SegmentReader using too much memory?
I've noticed that after stress-testing my application (uses Lucene 2.0) for I while, I have almost 200mb of byte[]s hanging around, the top two culprits being: 24 x SegmentReader.Norm.bytes = 112mb 2 x SegmentReader.ones = 16mb The second one isn't a big deal, but I wonder what's the explanation for the first one? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SegmentReader using too much memory?
Yonik Seeley wrote: On 12/11/06, Eric Jain [EMAIL PROTECTED] wrote: I've noticed that after stress-testing my application (uses Lucene 2.0) for I while, I have almost 200mb of byte[]s hanging around, the top two culprits being: 24 x SegmentReader.Norm.bytes = 112mb 2 x SegmentReader.ones = 16mb Each indexed field has a norm array that is the product of it's index-time boost and the length normalization factor. If you don't need either, you can omit the norms (as it looks like you already have on some fields given that ones is the fake norms used in place of the real norms). Thanks for the explanation. Not sure where the fields without norms come from: I use neither Field.setOmitNorms nor Index.NO_NORMS anywhere! I do want to use document boosting... Is that independent from field boosting? The length normalization on the other hand may not be necessary. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SegmentReader using too much memory?
Yonik Seeley wrote: It's read on demand, per indexed field. So assuming your index is optimized (a single segment), then it increases by one byte[] each time you search on a new field. OK, makes sense then. Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Avoiding ParseExceptions
Chris Nokleberg wrote: I am using the QueryParser with a StandardAnalyzer. I would like to avoid or auto-correct anything that would lead to a ParseException. For example, I don't think you can get a parse exception from Google--even if you omit a closing quote it looks like it just closes it for you (please correct me if you know otherwise). Would definitely be nice if the QueryParser had both a strict and a lenient mode. If you used the latter, it would of course wise to reflect the actual executed query back to the user, so it's clear what's going on. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexUpdateListener
Chris Hostetter wrote: THe only usefull callback/listner abstractions i can think of are when you want to know if someone has finished with a set of changes -- wether that change is adding one document, deleting one document, or adding/deleting a whole bunch of documents isn't really relevent, you still want to know that a complete set has been modified, so you aren't constantly flushing caches or reopening IndexReaders everytime a single document is added. Speaking of listeners: Would be great if there was a way to know when optimize() changes a document ID. Storing document IDs externally is the only way to merge Lucene queries with queries in a relational database efficiently (as far as I know), but the inability to track document ID changes complicates things a bit... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Preventing phrase queries from matching across lines
Erik Hatcher wrote: On Apr 28, 2006, at 5:35 AM, Eric Jain wrote: What is the best way to prevent a phrase query such as eggs white matching fried eggs\nwhite snow? Two possibilities I have thought about: 1. Replace all line breaks with a special string, e.g. newline. 2. Have an analyzer somehow increment the position of a term for each line break it encounters. Latter seems a bit more complicated to implement, but it would also be more efficient, right? Or are there better options? #2 shouldn't be too hard to implement, but you'll need to catch new lines in the initial tokenizer. I'm not sure about the efficiency, both options would require a tokenizer detecting new lines and either injecting a new term or setting a flag such that the next term gets a position increment bump. Thanks, #2 turned out to be easier to implement than expected. I should have precised that the efficiency I was concerned about was not the efficiency of the tokenization, but the impact of having all those additional newline term (positions) in the index. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Preventing phrase queries from matching across lines
What is the best way to prevent a phrase query such as eggs white matching fried eggs\nwhite snow? Two possibilities I have thought about: 1. Replace all line breaks with a special string, e.g. newline. 2. Have an analyzer somehow increment the position of a term for each line break it encounters. Latter seems a bit more complicated to implement, but it would also be more efficient, right? Or are there better options? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Performance Issues
thomasg wrote: 1) By default, Lucene only indexes the first 10,000 words from each document. When increasing this default out-of-memory errors can occur. This implies that documents, or large sections thereof, are loaded into memory. ISYS has a very small memory footprint which is not affected by document size nor number of documents. As far as I know, documents do indeed have to be built in memory prior to indexing. But this shouldn't be a problem unless you have only a few megabytes of memory, or you have documents that are hundreds of megabytes large -- and such large documents should probably be split, anyway. 2) Lucene appears to be slow at indexing, at least by ISYS' standards. Published performance benchmarks seem to vary between almost acceptable, down to very poor. ISYS' file readers are already optimized for the fastest text extraction possible. Indexing performance is my main concern with Lucene, though there are several parameters that can be tuned and I haven't exhausted all of them yet... Currently I am using: writer.setMergeFactor(100); writer.setMaxBufferedDocs(100); writer.setUseCompoundFile(false); This allows me to build a 3GB index with about 3M documents in 6h on a 2x2GHz Intel Xeon machine with 1GB of memory and a reasonably fast hard disk. There is some other stuff going on besides the indexing, but the indexing does seem to take up the greatest amount of time. Note that Lucene also supports incremental updates. 3) The Lucene documentation suggests it can be slow at searching and can get slower and slower the larger your indexes get. The tipping point is where the index size exceeds the amount of free memory in your machine. This also implies that whole indexes, or large portions of them, are loaded into memory. The bigger the index, the more powerful the machine required. ISYS' search speed is always proportional to the size of the result set. Index size does not materially affect search speed and the index is never loaded into memory. It also appears that Lucene requires hands-on tuning to keep its search speed acceptable. ISYS' indexes are self-managing and do not require any maintenance to keep them searchable at full speed. Queries on the index mentioned above return results within a few milliseconds, with less than 256MB used by the VM, though some complex queries that contain a lot of frequent terms may take up to several seconds. I'm not sure how Lucene's searching performance can be tuned, but haven't bother to do so as it hasn't been a bottleneck, so far... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Appending * to each search term
Florian Hanke wrote: I'd like to append an * (create a WildcardQuery) to each search term in a query, such that a query that is entered as e.g. term1 AND term2 is modified (effectively) to term1* AND term2*. Parsing the search string is not very elegant (of course). I'm thinking that overriding QueryParser#get(Boolean etc.)Query is the way to go, the way it's designed. But still, extracting terms and injecting them back in while operating on specific Query classes does not seem the way to go. Can anyone perhaps suggest a nice alternative? Perhaps you could subclass the QueryParser and override the getFieldQuery method: protected Query getFieldQuery(String field, String term) { return new PrefixQuery(new Term(field, term)); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speed
[EMAIL PROTECTED] wrote: When I make search I get count = 37. May be I do something not correctly? I assume you are ran both variants repeatedly, in the same process (start up costs etc)? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Get only count
Anton Potehin wrote: Now I create new search for get number of results. For example: IndexSearcher is = ... Query q = ... numberOfResults = Is.search(q).length(); Can I accelerate this example ? And how ? Perhaps something like: class CountingHitCollector implements HitCollector { public int count; public void collect(int doc, float score) { if (score 0.0f) ++count; } } ... CountingHitCollector c = new CountingHitCollector(); searcher.search(query, c); int hits = c.count; - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sub search
Anton Potehin wrote: After it I want to not make a new search, I want to make search among found results... Perhaps something like this would work: final BitSet results = toBitSet(Hits); searcher.search(newQuery, new Filter() { public BitSet bits(IndexReader reader) { return results; } }); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiPhraseQuery
Daniel Naber wrote: Please try to add this to MultiPhraseQuery and let us know if it helps: public List getTerms() { return termArrays; } That is indeed all I need (the list wouldn't have to be mutable though). Any chance this could be committed? Incidentally, would be helpful if the PrecedenceQueryParser instantiated MultiPhraseQueries via a call to an (overridable) getMultiPhraseQuery method. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiPhraseQuery
I need to write a function that copies a MultiPhraseQuery and changes the field the query applies to. Unfortunately the API allows access to neither the contained terms nor the field! The other query classes I have so far dealt with all seem to allow access to the contained query terms... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser dropping constraints?
I've noticed that while the QueryParser (both the default QueryParser and the PrecedenceQueryParser) refuse to parse foo bar) baz they both seem to interpret foo bar( baz as foo bar Bug or feature? In any case, would be great if there was a strict mode, and a more lenient mode where incorrect syntax is ignored (as far as possible). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing performance with Lucene 1.9
Eric Jain wrote: I'll rerun the indexing procedure with the old version overnight, just to be sure. Just to confirm: There no longer seems to be any difference in indexing performance between the nightly build and 1.4.3. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Solr, the Lucene based Search Server
Yonik Seeley wrote: Solr is a new open-source search server that's based on Lucene, and has XML/HTTP interfaces for updating and querying, declarative specification of analyzers and field types via a schema, extensive caching, replication, and a web admin interface. Just had a look, quite impressive. I noticed that you have a WordDelimiterFilter; any chance that this will be contributed back to Lucene? This class is really useful! (In fact I was just trying to write something similar myself...) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing performance with Lucene 1.9
Daniel Naber wrote: A fix has now been committed to trunk in SVN, it should be part of the next 1.9 release. Performance seems to have recovered, more or less, thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing performance with Lucene 1.9
Otis Gospodnetic wrote: Regarding performance fix - if you can be more precise (is it really just more or less or is it as good as before), that would be great for those of us itching to use 1.9. Yes, I can confirm that performance differs by no more than 3.1 fraggles. ;-) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Frequency of phrase
Doug Cutting wrote: If you use a span query then you can get the actual number of phrase instances. Thanks, good to know! In this case (need to suggest phrase queries to the user) I've now settled with dividing the number of hits for a potential phrase by the number of documents that contain all terms in the phrase. Seems to be fast and work well... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Frequency of phrase
Dave Kor wrote: Not sure if this is what you want, but what I have done is to issue exact phrase queries to Lucene and counted the number of hits found. This gives you the number of documents containing the phrase, rather than the number of occurrences of the phrase itself, but that may in fact be good enough... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Frequency of phrase
This is somewhat related to a question sent to this list a while ago: Is there an efficient way to count the number of occurrences of a phrase (not term) in an index? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryPrinter?
I need to parse a query string, modify it a bit, and then output the modified query string. This works quite well with query.toString(), except that when I parse the query I set DEFAULT_OPERATOR_AND, and the output of BooleanQuery.toString() assumes DEFAULT_OPERATOR_OR... Would be great if this behavior could be changed through a static field, or perhaps someone has already written some kind of QueryPrinter that is a bit more flexible? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Generating phrase queries from term queries
Chris Hostetter wrote: (Assuming *I* understand it) what he's talking baout, is the ability for his search GUI to display suggested phrase searches you may want to try which consist of the words you just typed in grouped into phrases. Yes, that's precisely what I am talking about. Sorry for being unclear. Presumably, if multiple phrases in the source data can be found in the permutations of hte search words, the least common are the ones you'd want to sugggest -- which makes the problem a sort of SIP problem (ie: given an extremely limited set of words, find the Statistically imporbably phrases in the corpus made using only subsets of those words) I'd already be happy to get *any* phrases :-) If the phrases could be ranked, I might prefer to pick the *most frequent* phrases. For example: anopheles anopheles malaria (anopheles anopheles is the latin name for the common mosquito) I'd like to be able to suggest quoting this name to eliminate all the other mosquito species that also contain anopheles in their name. There are lots of documents with anopheles anopheles. There may also be a document or two where anopheles happens to appear next to malaria, but these are less interesting here. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Generating phrase queries from term queries
Paul Elschot wrote: One way that might be better is to provide your own Scorer that works on the term positions of the three or more terms. This would be better for performance because it only uses one term positions object per query term (a, b, and c here). I'm trying to extract the actual phrases, rather than scoring documents with terms that appear in the same order higher (though that would seem like a good idea, too). The idea is that once I have the phrases, I can suggest something like show only matches where a and b appear next to each other. Not terribly important, but if there was a simple and efficient way to accomplish this... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scoring by number of terms in field
Paul Elschot wrote: In case you prefer to use the maximum score over the clauses you can use the DisjunctionMaxQuery from the development version. Yes, that may help! I'll need to have a look... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Generating phrase queries from term queries
Is there an efficient way to determine if two or more terms frequently appear next to each other sequence? For a query like: a b c one or more of the following suggestions could be generated: a b c a b c a b c I could of course just run a search with all possible combinations, but perhaps there is a better way? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Scoring by number of terms in field
Lucene seems to prefer matches in shorter documents. Is it possible to influence the scoring mechanism to have matches in shorter fields score higher instead? For example, a query for europe should rank: 1. title:Europe 2. title:History of Europe 3. title:Travel in Europe, Middle East and Africa 4. subtitle:Fairy Tales from Europe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scoring by number of terms in field
Paul Elschot wrote: For example, a query for europe should rank: 1. title:Europe 2. title:History of Europe 3. title:Travel in Europe, Middle East and Africa 4. subtitle:Fairy Tales from Europe Perhaps with this query (assuming the default implicit OR): title:europe subtitle:europe^0.5 body:europe This will ensure that match 4 appears at the end, but as far as I can see this won't help with getting matches 1-3 ordered correctly? Note that match 1 for example may have a description field that contains a lot terms, but no mention of the query term. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]