SegmentReader using too much memory?

2006-12-11 Thread Eric Jain
I've noticed that after stress-testing my application (uses Lucene 2.0) for I while, I have almost 200mb of byte[]s hanging around, the top two culprits being: 24 x SegmentReader.Norm.bytes = 112mb 2 x SegmentReader.ones = 16mb The second one isn't a big deal, but I wonder what's the

Re: SegmentReader using too much memory?

2006-12-11 Thread Eric Jain
Yonik Seeley wrote: On 12/11/06, Eric Jain [EMAIL PROTECTED] wrote: I've noticed that after stress-testing my application (uses Lucene 2.0) for I while, I have almost 200mb of byte[]s hanging around, the top two culprits being: 24 x SegmentReader.Norm.bytes = 112mb 2 x SegmentReader.ones

Re: SegmentReader using too much memory?

2006-12-11 Thread Eric Jain
Yonik Seeley wrote: It's read on demand, per indexed field. So assuming your index is optimized (a single segment), then it increases by one byte[] each time you search on a new field. OK, makes sense then. Thanks! - To

Re: Avoiding ParseExceptions

2006-06-06 Thread Eric Jain
Chris Nokleberg wrote: I am using the QueryParser with a StandardAnalyzer. I would like to avoid or auto-correct anything that would lead to a ParseException. For example, I don't think you can get a parse exception from Google--even if you omit a closing quote it looks like it just closes it

Re: IndexUpdateListener

2006-05-15 Thread Eric Jain
Chris Hostetter wrote: THe only usefull callback/listner abstractions i can think of are when you want to know if someone has finished with a set of changes -- wether that change is adding one document, deleting one document, or adding/deleting a whole bunch of documents isn't really relevent,

Re: Preventing phrase queries from matching across lines

2006-04-29 Thread Eric Jain
Erik Hatcher wrote: On Apr 28, 2006, at 5:35 AM, Eric Jain wrote: What is the best way to prevent a phrase query such as eggs white matching fried eggs\nwhite snow? Two possibilities I have thought about: 1. Replace all line breaks with a special string, e.g. newline. 2. Have an analyzer

Preventing phrase queries from matching across lines

2006-04-28 Thread Eric Jain
What is the best way to prevent a phrase query such as eggs white matching fried eggs\nwhite snow? Two possibilities I have thought about: 1. Replace all line breaks with a special string, e.g. newline. 2. Have an analyzer somehow increment the position of a term for each line break it

Re: Lucene Performance Issues

2006-03-28 Thread Eric Jain
thomasg wrote: 1) By default, Lucene only indexes the first 10,000 words from each document. When increasing this default out-of-memory errors can occur. This implies that documents, or large sections thereof, are loaded into memory. ISYS has a very small memory footprint which is not affected

Re: Appending * to each search term

2006-03-17 Thread Eric Jain
Florian Hanke wrote: I'd like to append an * (create a WildcardQuery) to each search term in a query, such that a query that is entered as e.g. term1 AND term2 is modified (effectively) to term1* AND term2*. Parsing the search string is not very elegant (of course). I'm thinking that

Re: speed

2006-03-10 Thread Eric Jain
[EMAIL PROTECTED] wrote: When I make search I get count = 37. May be I do something not correctly? I assume you are ran both variants repeatedly, in the same process (start up costs etc)? - To unsubscribe, e-mail:

Re: Get only count

2006-03-07 Thread Eric Jain
Anton Potehin wrote: Now I create new search for get number of results. For example: IndexSearcher is = ... Query q = ... numberOfResults = Is.search(q).length(); Can I accelerate this example ? And how ? Perhaps something like: class CountingHitCollector implements HitCollector {

Re: sub search

2006-03-07 Thread Eric Jain
Anton Potehin wrote: After it I want to not make a new search, I want to make search among found results... Perhaps something like this would work: final BitSet results = toBitSet(Hits); searcher.search(newQuery, new Filter() { public BitSet bits(IndexReader reader) { return results;

Re: MultiPhraseQuery

2006-03-06 Thread Eric Jain
Daniel Naber wrote: Please try to add this to MultiPhraseQuery and let us know if it helps: public List getTerms() { return termArrays; } That is indeed all I need (the list wouldn't have to be mutable though). Any chance this could be committed? Incidentally, would be helpful if

MultiPhraseQuery

2006-03-05 Thread Eric Jain
I need to write a function that copies a MultiPhraseQuery and changes the field the query applies to. Unfortunately the API allows access to neither the contained terms nor the field! The other query classes I have so far dealt with all seem to allow access to the contained query terms...

QueryParser dropping constraints?

2006-03-05 Thread Eric Jain
I've noticed that while the QueryParser (both the default QueryParser and the PrecedenceQueryParser) refuse to parse foo bar) baz they both seem to interpret foo bar( baz as foo bar Bug or feature? In any case, would be great if there was a strict mode, and a more lenient mode

Re: Indexing performance with Lucene 1.9

2006-03-01 Thread Eric Jain
Eric Jain wrote: I'll rerun the indexing procedure with the old version overnight, just to be sure. Just to confirm: There no longer seems to be any difference in indexing performance between the nightly build and 1.4.3

Re: Solr, the Lucene based Search Server

2006-03-01 Thread Eric Jain
Yonik Seeley wrote: Solr is a new open-source search server that's based on Lucene, and has XML/HTTP interfaces for updating and querying, declarative specification of analyzers and field types via a schema, extensive caching, replication, and a web admin interface. Just had a look, quite

Re: Indexing performance with Lucene 1.9

2006-02-28 Thread Eric Jain
Daniel Naber wrote: A fix has now been committed to trunk in SVN, it should be part of the next 1.9 release. Performance seems to have recovered, more or less, thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: Indexing performance with Lucene 1.9

2006-02-28 Thread Eric Jain
Otis Gospodnetic wrote: Regarding performance fix - if you can be more precise (is it really just more or less or is it as good as before), that would be great for those of us itching to use 1.9. Yes, I can confirm that performance differs by no more than 3.1 fraggles. ;-)

Re: Frequency of phrase

2006-02-25 Thread Eric Jain
Doug Cutting wrote: If you use a span query then you can get the actual number of phrase instances. Thanks, good to know! In this case (need to suggest phrase queries to the user) I've now settled with dividing the number of hits for a potential phrase by the number of documents that

Re: Frequency of phrase

2006-02-24 Thread Eric Jain
Dave Kor wrote: Not sure if this is what you want, but what I have done is to issue exact phrase queries to Lucene and counted the number of hits found. This gives you the number of documents containing the phrase, rather than the number of occurrences of the phrase itself, but that may in

Frequency of phrase

2006-02-23 Thread Eric Jain
This is somewhat related to a question sent to this list a while ago: Is there an efficient way to count the number of occurrences of a phrase (not term) in an index? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional

QueryPrinter?

2006-02-18 Thread Eric Jain
I need to parse a query string, modify it a bit, and then output the modified query string. This works quite well with query.toString(), except that when I parse the query I set DEFAULT_OPERATOR_AND, and the output of BooleanQuery.toString() assumes DEFAULT_OPERATOR_OR... Would be great if

Re: Generating phrase queries from term queries

2006-01-12 Thread Eric Jain
Chris Hostetter wrote: (Assuming *I* understand it) what he's talking baout, is the ability for his search GUI to display suggested phrase searches you may want to try which consist of the words you just typed in grouped into phrases. Yes, that's precisely what I am talking about. Sorry for

Re: Generating phrase queries from term queries

2006-01-11 Thread Eric Jain
Paul Elschot wrote: One way that might be better is to provide your own Scorer that works on the term positions of the three or more terms. This would be better for performance because it only uses one term positions object per query term (a, b, and c here). I'm trying to extract the actual

Re: Scoring by number of terms in field

2006-01-10 Thread Eric Jain
Paul Elschot wrote: In case you prefer to use the maximum score over the clauses you can use the DisjunctionMaxQuery from the development version. Yes, that may help! I'll need to have a look... - To unsubscribe, e-mail:

Generating phrase queries from term queries

2006-01-10 Thread Eric Jain
Is there an efficient way to determine if two or more terms frequently appear next to each other sequence? For a query like: a b c one or more of the following suggestions could be generated: a b c a b c a b c I could of course just run a search with all possible combinations, but perhaps

Scoring by number of terms in field

2006-01-09 Thread Eric Jain
Lucene seems to prefer matches in shorter documents. Is it possible to influence the scoring mechanism to have matches in shorter fields score higher instead? For example, a query for europe should rank: 1. title:Europe 2. title:History of Europe 3. title:Travel in Europe, Middle East and

Re: Scoring by number of terms in field

2006-01-09 Thread Eric Jain
Paul Elschot wrote: For example, a query for europe should rank: 1. title:Europe 2. title:History of Europe 3. title:Travel in Europe, Middle East and Africa 4. subtitle:Fairy Tales from Europe Perhaps with this query (assuming the default implicit OR): title:europe subtitle:europe^0.5