Re: index large size file

2009-03-11 Thread Danil ŢORIN
The problem you may face that for such large documents,is that there is a high probability that most of terms will be present in all documents. So on search you'll receive a lot of documents (if you need to retrieve full text, it will take a while), but the bigger problem is usability: what a

Re: A model for predicting indexing memory costs?

2009-03-11 Thread Florian Weimer
* mark harwood: Could you get a heap dump (eg with YourKit) of what's using up all the memory when you hit OOM? On this particular machine I have a JRE, no admin rights and therefore limited profiling capability :( Maybe this could give you a heap dump which you can analyze on a different

Re: A model for predicting indexing memory costs?

2009-03-11 Thread mark harwood
Thanks, I have a heap dump now from a run with reduced JVM memory (in order to speed up a failure point) and am working through it offline with VisualVm. This test induced a proper OOM as opposed to one of those timed out waiting for GC type OOMs so may be misleading. The main culprit in this

Re: A model for predicting indexing memory costs?

2009-03-11 Thread Michael McCandless
mark harwood wrote: Thanks, I have a heap dump now from a run with reduced JVM memory (in order to speed up a failure point) and am working through it offline with VisualVm. This test induced a proper OOM as opposed to one of those timed out waiting for GC type OOMs so may be

Re: A model for predicting indexing memory costs?

2009-03-11 Thread Mark Miller
Michael McCandless wrote: Ie, it's still not clear if you are running out of memory vs hitting some weird it's too hard for GC to deal kind of massive heap fragmentation situation or something. It reminds me of the special (I cannot be played on record player X) record (your application)

Re: How to search both Tokenized and Untokenized fields

2009-03-11 Thread Erick Erickson
Well, PerFieldAnalyzerWrapper is just a bunch of Analyzers,independent of queries. See the API, but in general PerFieldAnalyzerWrapper perf = new PerFieldAnalyzerWrapper(default, new StandardAnalyzer()); perf.add(untokenized, new WhitespaceAnalyzer()); perf.add(tokenized, new SnowballAnalyzer());

Re: A model for predicting indexing memory costs?

2009-03-11 Thread Michael McCandless
Mark Miller wrote: Michael McCandless wrote: Ie, it's still not clear if you are running out of memory vs hitting some weird it's too hard for GC to deal kind of massive heap fragmentation situation or something. It reminds me of the special (I cannot be played on record player X)

Re: A model for predicting indexing memory costs?

2009-03-11 Thread mark harwood
OK, it's early days and I'm holding my breath but I'm currently progressing further through my content without an OOM just by using a different GC setting. Thanks to advice here and colleagues at work I've gone with a GC setting of -XX:+UseSerialGC for this indexing task. The rationale that

Integer2String Covnersation

2009-03-11 Thread Allahbaksh Mohammedali Asadullah
Hi all, Can any one explain How function integer2String works. public static int int2sortableStr(int val, char[] out, int offset) { val += Integer.MIN_VALUE; out[offset++] = (char)(val 24); out[offset++] = (char)((val 12) 0x0fff); out[offset++] = (char)(val 0x0fff);

Re: Integer2String Covnersation

2009-03-11 Thread Yonik Seeley
On Wed, Mar 11, 2009 at 9:54 AM, Allahbaksh Mohammedali Asadullah allahbaksh_asadul...@infosys.com wrote: Hi all, Can any one explain How function integer2String works.  public static int int2sortableStr(int val, char[] out, int offset) {    val += Integer.MIN_VALUE; This maps MIN_VALUE to

search problem when indexed using Field.setOmitTf()

2009-03-11 Thread Siraj Haider
We are having a problem running searches on an index after upgrading to 2.4 and using the new Field.setOmitTf() function. The index size has been dramatically reduces and even the search performace is better. But searches do not return any results if searching for something that has a space

RE: Integer2String Covnersation

2009-03-11 Thread Allahbaksh Mohammedali Asadullah
Hi, I didn't get what exactly does shifiting 24 times and shifing 12 times does. Is there any Character at that value or is there some differenciator? Can some one go in bit details. Regards. -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik

Re: Integer2String Covnersation

2009-03-11 Thread Yonik Seeley
On Wed, Mar 11, 2009 at 10:25 AM, Allahbaksh Mohammedali Asadullah allahbaksh_asadul...@infosys.com wrote: Hi, I didn't get what exactly does shifiting 24 times and shifing 12 times does. Is there any Character at that value or is there some differenciator? Can some one go in bit details.

Re: Questions about analyzer

2009-03-11 Thread Michael McCandless
Ganesh wrote: Mike in of his replies to the thread Faceted search using Lucene, gave the following code review comment * You are creating a new Analyzer QueryParser every time, also creating unnecessary garbage; instead, they should be created once reused. This made me to ask the below

Re: Lucene 2.9

2009-03-11 Thread Michael McCandless
Allahbaksh Mohammedali Asadullah wrote: For example I want to search amount = 15 rather than doing it amount:[ 15] or something? Is there any open source queryparser which converts something like amount =15 into lucene number format query. I don't know of any effort to change Lucene's

Re: Lucene 2.9

2009-03-11 Thread Mark Miller
Hmmm - you can probably get qsol to do it: http://myhardshadow.com/qsol. I think you can setup any token to expand to anything with a regex matcher and use group capturing in the replacement (I don't fully remember though, been a while since I've used it). So you could do a regex of something

Re: Lucene 2.9

2009-03-11 Thread Michael McCandless
Yonik Seeley wrote: On Mon, Mar 9, 2009 at 2:02 PM, Michael McCandless luc...@mikemccandless.com wrote: Once added, something inside the index (a write once schema) records that this field is an IntField and then it's an error to ever use a different type field by that same name. I

Memory during Indexing

2009-03-11 Thread Niels Ott
Hi Lucene professionals! This may sound like a dumb beginner's question, but anyways: Can Lucene run out of memory during indexing? Should I use IndexWriter.flush() or .commit(), and if so, how often? Thank you for your support. Niels -- Niels Ott Computational Linguist (B.A.)

Re: Lucene Highlighting and Dynamic Summaries

2009-03-11 Thread markharw00d
If you can supply a Junit test that recreates the problem I think we can start to make progress on this. Amin Mohammed-Coleman wrote: Hi Apologies for re sending this mail. Just wondering if anyone has experienced the below. I'm not sure if this could happen due nature of document. It

Re: Memory during Indexing

2009-03-11 Thread markharw00d
Hi Niels, See the javadocs for IndexWriter.setRAMBufferSizeMB() Cheers Mark Niels Ott wrote: Hi Lucene professionals! This may sound like a dumb beginner's question, but anyways: Can Lucene run out of memory during indexing? Should I use IndexWriter.flush() or .commit(), and if so, how

Re: search problem when indexed using Field.setOmitTf()

2009-03-11 Thread Yonik Seeley
On Wed, Mar 11, 2009 at 2:35 PM, Michael McCandless luc...@mikemccandless.com wrote: This is expected: phrase searches will not work when you omitTf. But why would a phrase query be created? The code given looks like it should create a boolean query with two terms. Of course, the given code

Re: search problem when indexed using Field.setOmitTf()

2009-03-11 Thread Siraj Haider
Yonik Seeley wrote: On Wed, Mar 11, 2009 at 2:35 PM, Michael McCandless luc...@mikemccandless.com wrote: This is expected: phrase searches will not work when you omitTf. But why would a phrase query be created? The code given looks like it should create a boolean query with two

Re: search problem when indexed using Field.setOmitTf()

2009-03-11 Thread Michael McCandless
Siraj Haider wrote: Yonik Seeley wrote: On Wed, Mar 11, 2009 at 2:35 PM, Michael McCandless luc...@mikemccandless.com wrote: This is expected: phrase searches will not work when you omitTf. But why would a phrase query be created? The code given looks like it should create a boolean

Re: sloppyFreq question

2009-03-11 Thread Chris Hostetter
: For a 'SpanNearQuery', this reduces the effect of the term frequency on the : score as the number of terms in the span increases. So, for a simple phrase : query (using spans), the longer the phrase, the lower the TF. For a simple : SpanTermQuery, the TF is reduced in half (1.0f / 1 + 1). : :

Re: sloppyFreq question

2009-03-11 Thread Chris Hostetter
: For a SpanNearQuery that contains SpanTermQueries, the score for a match on : the quick brown fox would be lower than a match on brown fox because of : the edit distance (4 vs 2). This seems counter intuitive, too. you have to clarify what you mean ... if you're talking about a SpanNearQuery

Re: index large size file

2009-03-11 Thread Chris Hostetter
: Subject: index large size file : In-Reply-To: 49b5fc5e.10...@r.email.ne.jp http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if

Re: sloppyFreq question

2009-03-11 Thread Peter Keegan
I suppose SpanTermQuery could override the weight/scorer methods so that it behaved more like a TermQuery if it was executed directly ... but that's really not what it's intended for. This is currently the only way to boost a term via payloads. BoostingTermQuery extends SpanTermQuery. if

Re: Memory during Indexing

2009-03-11 Thread Niels Ott
Hi Mark, markharw00d schrieb: Hi Niels, See the javadocs for IndexWriter.setRAMBufferSizeMB() I tried different settings. Apart from the fact that my memory issue seems to by my own fault, I'm wondering what Lucene does in the background. Apparently it does flush(), but not commit()? At

RE: How to search both Tokenized and Untokenized fields

2009-03-11 Thread Fang_Li
Hi, What do you mean untokenized field? Are you using different analyzer for different field? If yes, I think you just use the same analyzer (PerfieldAnalyzer, I guess) for query. Li -Original Message- From: rokham [mailto:somebodyik...@gmail.com] Sent: Monday, March 09, 2009 11:02 PM