Re: SpanQuery for Terms at same position
Op woensdag 25 november 2009 21:20:33 schreef Christopher Tignor: It's worth noting however that this -1 slop doesn't seem to work for cases where oyu want to discover instances of more than two terms at the same position. Would be nice to be able to explicitly set this in the query construction. I think requiring n terms at the same position would need a slop of 1-n, and I'd like to have some test cases added for that. Now if I only had some time... Regards, Paul Elschot thanks, CT On Tue, Nov 24, 2009 at 9:17 AM, Christopher Tignor ctig...@thinkmap.comwrote: yes that indeed works for me. thanks, CT On Mon, Nov 23, 2009 at 5:50 PM, Paul Elschot paul.elsc...@xs4all.nlwrote: Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor: Also, I noticed that with the above edit to NearSpansOrdered I am getting erroneous results fo normal ordered searches using searches like: _n followed by work where because _n and work are at the same position the code changes accept their pairing as a valid in-order result now that the eqaul to clause has been added to the inequality. Thanks for trying this. Indeed the followed by semantics is broken for the ordered case when spans at the same positions are considered ordered. Did I understand correctly that the unordered case with a slop of -1 and without the edit works to match terms at the same position? In that case it may be worthwhile to add that to the javadocs, and also add a few testcases. Regards, Paul Elschot CT On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor ctig...@thinkmap.comwrote: Thanks so much for this. Using an un-ordered query, the -1 slop indeed returns the correct results, matching tokens at the same position. I tried the same query but ordered both after and before rebuilding the source with Paul's changes to NearSpansOrdered but the query was still failing, returning no results. CT On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller markrmil...@gmail.com wrote: Your trying -1 with ordered right? Try it with non ordered. Christopher Tignor wrote: A slop of -1 doesn't work either. I get no results returned. this would be a *really* helpful feature for me if someone might suggest an implementation as I would really like to be able to do arbitrary span searches where tokens may be at the same position and also in other positions where the ordering of subsequent terms may be restricted as per the normal span API. thanks, CT On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot paul.elsc...@xs4all.nl wrote: Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: Hi, I didn't test, but you might want to try SpanNearQuery and set slop to zero. Give it a try and let me know if it worked. The slop is the number of positions in between, so zero would still be too much to only match at the same position. SpanNearQuery may or may not work for a slop of -1, but one could try that for both the ordered and unordered cases. One way to do that is to start from the existing test cases. Regards, Paul Elschot Regards, Adriano Crestani On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor ctig...@thinkmap.comwrote: Hello, I would like to search for all documents that contain both plan and _v (my part of speech token for verb) at the same position. I have tokenized the documents accordingly so these tokens exists at the same location. I can achieve programaticaly using PhraseQueries by adding the Terms explicitly at the same position but I need to be able to recover the Payload data for each term found within the matched instance of my query. Unfortunately the PayloadSpanUtil doesn't seem to return the same results as the PhraseQuery, possibly becuase it is converting it inoto Spans first which do not support searching for Terms at the same document position? Any help appreciated. thanks, CT -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional
Re: SpanQuery for Terms at same position
Op maandag 23 november 2009 17:27:56 schreef Christopher Tignor: A slop of -1 doesn't work either. I get no results returned. I think the problem is in the NearSpansOrdered.docSpansOrdered methods. Could you replace the by = in there (4 times) and try again? That will allow spans at the same position to be considered ordered. From a quick reading of the code both the unordered and ordered cases might work for a slop of -1 with that modification. this would be a *really* helpful feature for me if someone might suggest an implementation as I would really like to be able to do arbitrary span searches where tokens may be at the same position and also in other positions where the ordering of subsequent terms may be restricted as per the normal span API. My pleasure, Paul Elschot thanks, CT On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot paul.elsc...@xs4all.nlwrote: Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: Hi, I didn't test, but you might want to try SpanNearQuery and set slop to zero. Give it a try and let me know if it worked. The slop is the number of positions in between, so zero would still be too much to only match at the same position. SpanNearQuery may or may not work for a slop of -1, but one could try that for both the ordered and unordered cases. One way to do that is to start from the existing test cases. Regards, Paul Elschot Regards, Adriano Crestani On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor ctig...@thinkmap.comwrote: Hello, I would like to search for all documents that contain both plan and _v (my part of speech token for verb) at the same position. I have tokenized the documents accordingly so these tokens exists at the same location. I can achieve programaticaly using PhraseQueries by adding the Terms explicitly at the same position but I need to be able to recover the Payload data for each term found within the matched instance of my query. Unfortunately the PayloadSpanUtil doesn't seem to return the same results as the PhraseQuery, possibly becuase it is converting it inoto Spans first which do not support searching for Terms at the same document position? Any help appreciated. thanks, CT -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: SpanQuery for Terms at same position
Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor: Also, I noticed that with the above edit to NearSpansOrdered I am getting erroneous results fo normal ordered searches using searches like: _n followed by work where because _n and work are at the same position the code changes accept their pairing as a valid in-order result now that the eqaul to clause has been added to the inequality. Thanks for trying this. Indeed the followed by semantics is broken for the ordered case when spans at the same positions are considered ordered. Did I understand correctly that the unordered case with a slop of -1 and without the edit works to match terms at the same position? In that case it may be worthwhile to add that to the javadocs, and also add a few testcases. Regards, Paul Elschot CT On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor ctig...@thinkmap.comwrote: Thanks so much for this. Using an un-ordered query, the -1 slop indeed returns the correct results, matching tokens at the same position. I tried the same query but ordered both after and before rebuilding the source with Paul's changes to NearSpansOrdered but the query was still failing, returning no results. CT On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller markrmil...@gmail.comwrote: Your trying -1 with ordered right? Try it with non ordered. Christopher Tignor wrote: A slop of -1 doesn't work either. I get no results returned. this would be a *really* helpful feature for me if someone might suggest an implementation as I would really like to be able to do arbitrary span searches where tokens may be at the same position and also in other positions where the ordering of subsequent terms may be restricted as per the normal span API. thanks, CT On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot paul.elsc...@xs4all.nl wrote: Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: Hi, I didn't test, but you might want to try SpanNearQuery and set slop to zero. Give it a try and let me know if it worked. The slop is the number of positions in between, so zero would still be too much to only match at the same position. SpanNearQuery may or may not work for a slop of -1, but one could try that for both the ordered and unordered cases. One way to do that is to start from the existing test cases. Regards, Paul Elschot Regards, Adriano Crestani On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor ctig...@thinkmap.comwrote: Hello, I would like to search for all documents that contain both plan and _v (my part of speech token for verb) at the same position. I have tokenized the documents accordingly so these tokens exists at the same location. I can achieve programaticaly using PhraseQueries by adding the Terms explicitly at the same position but I need to be able to recover the Payload data for each term found within the matched instance of my query. Unfortunately the PayloadSpanUtil doesn't seem to return the same results as the PhraseQuery, possibly becuase it is converting it inoto Spans first which do not support searching for Terms at the same document position? Any help appreciated. thanks, CT -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999 -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: SpanQuery for Terms at same position
Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: Hi, I didn't test, but you might want to try SpanNearQuery and set slop to zero. Give it a try and let me know if it worked. The slop is the number of positions in between, so zero would still be too much to only match at the same position. SpanNearQuery may or may not work for a slop of -1, but one could try that for both the ordered and unordered cases. One way to do that is to start from the existing test cases. Regards, Paul Elschot Regards, Adriano Crestani On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor ctig...@thinkmap.comwrote: Hello, I would like to search for all documents that contain both plan and _v (my part of speech token for verb) at the same position. I have tokenized the documents accordingly so these tokens exists at the same location. I can achieve programaticaly using PhraseQueries by adding the Terms explicitly at the same position but I need to be able to recover the Payload data for each term found within the matched instance of my query. Unfortunately the PayloadSpanUtil doesn't seem to return the same results as the PhraseQuery, possibly becuase it is converting it inoto Spans first which do not support searching for Terms at the same document position? Any help appreciated. thanks, CT -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Efficient filtering advise
Try a MultiTermQueryWrapperFilter instead of the QueryFilter. I'd expect a modest gain in performance. In case it is possible to form a few groups of terms that are reused, it could even be more efficient to also use a CachingWrapperFilter for each of these groups. Regards, Paul Elschot Op zondag 22 november 2009 15:48:39 schreef Eran Sevi: Hi, I have a need to filter my queries using a rather large subset of terms (can be 10K or even 50K). All these terms are sure to exist in the index so the number of results can be about the same number of terms in the filter. The terms are numbers but are not subsequent and are from a large set of possible values (so range queries are probably not good for me). The index itself is about 1M docs and running even a simple query with such a large filter takes a lot of time even if the number of results is only a few hundred docs. It seems like the speed is affected by the length of the filter even if the number of results remains more or less the same, which is logical but not by such a large loss of performance as I'm experiencing (running the query with a 10K terms filter takes an average of 1s 187ms with 600 results while running it with a 50K terms filter takes an average of 5s 207ms with 1000 results). Currently I'm using a QueryFilter with a boolean query in which I OR the different terms together. I also can't use a cached filter efficiently since the terms to filter on change almost every query. I was wondering if there's a better way to filter my queries so they won't take a few seconds to run? Thanks in advance for any advise, Eran. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Efficient filtering advise
Op zondag 22 november 2009 17:23:53 schreef Eran Sevi: Thanks for the tips. I'm still using version 2.4 so I can't use MultiTermQueryWrapperFilter but I'll definitely try to re-group the the terms that are not changing in order to cache them. How can I join several such filters together? There are various ways. OpenBitSet and OpenBitSetDISI can do this, and there's also BooleanFilter and ChainedFilter in contrib. Using FieldCacheTermsFilter sounds promising. Fortunately it is a single value field (our unique doc id). Regards, Paul Elschot I'll consider very seriously moving to 2.9.1 in order to try it out and see if I can get so real gain from using it or maybe using TermsFilter from contrib. On Sun, Nov 22, 2009 at 6:10 PM, Uwe Schindler u...@thetaphi.de wrote: Maybe this helps you, but read the docs, it will work only with single-value-fields: http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldC acheTermsFilter.htmlhttp://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldC%0AacheTermsFilter.html Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Eran Sevi [mailto:erans...@gmail.com] Sent: Sunday, November 22, 2009 3:49 PM To: java-user@lucene.apache.org Subject: Efficient filtering advise Hi, I have a need to filter my queries using a rather large subset of terms (can be 10K or even 50K). All these terms are sure to exist in the index so the number of results can be about the same number of terms in the filter. The terms are numbers but are not subsequent and are from a large set of possible values (so range queries are probably not good for me). The index itself is about 1M docs and running even a simple query with such a large filter takes a lot of time even if the number of results is only a few hundred docs. It seems like the speed is affected by the length of the filter even if the number of results remains more or less the same, which is logical but not by such a large loss of performance as I'm experiencing (running the query with a 10K terms filter takes an average of 1s 187ms with 600 results while running it with a 50K terms filter takes an average of 5s 207ms with 1000 results). Currently I'm using a QueryFilter with a boolean query in which I OR the different terms together. I also can't use a cached filter efficiently since the terms to filter on change almost every query. I was wondering if there's a better way to filter my queries so they won't take a few seconds to run? Thanks in advance for any advise, Eran. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Proposal for changing Lucene's backwards-compatibility policy
On Friday 16 October 2009 08:57:37 Michael Busch wrote: Hello Lucene users: In the past we have discussed our backwards-compatibility policy frequently on the Lucene developer mailinglist and we are thinking about making some significant changes. In this mail I'd like to outline the proposed changes to get some feedback from the user community. Our current backwards-compatibility policy regarding API changes states that we can only make changes that break backwards-compatibility in major releases (3.0, 4.0, etc.); the next major release is the upcoming 3.0. Given how often we made major releases in the past in Lucene this means that deprecated APIs need to stay in Lucene for a very long time. E.g. if we deprecate an API in 3.1 we'll have to wait until 4.0 before we can remove it. This means that the code gets very cluttered and adding new features gets somewhat more difficult, as attention has to be paid to properly support the old *and* new APIs for a quite long time. The current policy also leads to delaying a last minor release before a major release (e.g. 2.9), because the developers consider it as the last chance for a long time to introduce new APIs and deprecate old ones. The proposal now is to change this policy in a way, so that an API can only be removed if it was deprecated in at least one release, which can be a major *or* minor release. E.g. if we deprecate an API and release it with 3.1, we can remove it with the 3.2 release. The obvious downside of this proposal is that a simple jar drop-in replacement will not be possible anymore with almost every Lucene release (excluding bugfix releases, e.g. 2.9.0-2.9.1). However, you can be sure that if you're using a non-deprecated API it will be in the next release. Note that of course these proposed changes do not affect backwards-compatibility with old index formats. I.e. it will still be possible to read all 3.X indexes with any Lucene 4.X version. Our main goal is to find the right balance between backwards-compatibility support for all the Lucene users out there and fast and productive development of new features. The developers haven't come to an agreement on this proposal yet. Potentionally giving up the drop-in replacement promise that Lucene could make in the past is the main reason for the struggle the developers are in and why we'd like to ask the user community for feedback to help us make a decision. After we gathered some feedback here we will call a vote on the development mailinglist where the committers have to officially decide whether to make these changes or not. So please tell us which you prefer as a back compatibility policy for Lucene: A) best effort drop-in back compatibility for minor version numbers (e.g. v3.5 will be compatible with v3.2) B) best effort drop-in back compatibility for the next minor version number only, and deprecations may be removed after one minor release (e.g. v3.3 will be compat with v3.2, but not v3.4) I'd prefer B), with a minimum period of about two months to the next release in case it removes deprecations. Regards, Paul Elschot
Re: faceted search performance
On Monday 12 October 2009 23:29:07 Christoph Boosz wrote: Hi Paul, Thanks for your suggestion. I will test it within the next few days. However, due to memory limitations, it will only work if the number of hits is small enough, am I right? One can load a single term vector at a time, so in this case the memory limitation is only in the possibly large map of doc counters per term. For best performance try and load the term vectors in docId order, after the original query has completed. In any case it would be good to somehow limit the number of documents considered, for example by using the ones with the best query score. Limiting the number of terms would also be good, but that less easy. Regards, Paul Elschot Chris 2009/10/12 Paul Elschot paul.elsc...@xs4all.nl Chris, You could also store term vectors for all docs at indexing time, and add the termvectors for the matching docs into a (large) map of terms in RAM. Regards, Paul Elschot On Monday 12 October 2009 21:30:48 Christoph Boosz wrote: Hi Jake, Thanks for your helpful explanation. In fact, my initial solution was to traverse each document in the result once and count the contained terms. As you mentioned, this process took a lot of memory. Trying to confine the memory usage with the facet approach, I was surprised by the decline in performance. Now I know it's nothing abnormal, at least. Chris 2009/10/12 Jake Mannix jake.man...@gmail.com Hey Chris, On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz christoph.bo...@googlemail.com wrote: Thanks for your reply. Yes, it's likely that many terms occur in few documents. If I understand you right, I should do the following: -Write a HitCollector that simply increments a counter -Get the filter for the user query once: new CachingWrapperFilter(new QueryWrapperFilter(userQuery)); -Create a TermQuery for each term -Perform the search and read the counter of the HitCollector I did that, but it didn't get faster. Any ideas why? This killer is the TermQuery for each term part - this is huge. You need to invert this process, and use your query as is, but while walking in the HitCollector, on each doc which matches your query, increment counters for each of the terms in that document (which means you need an in-memory forward lookup for your documents, like a multivalued FieldCache - and if you've got roughly the same number of terms as documents, this cache is likely to be as large as your entire index - a pretty hefty RAM cost). But a good thing to keep in mind is that doing this kind of faceting (massively multivalued on a huge term-set) requires a lot of computation, even if you have all the proper structures living in memory: For each document you look at (which matches your query), you need to look at all of the terms in that document, and increment a counter for that term. So however much time it would normally take for you to do the driving query, it can take as much as that multiplied by the average number of terms in a document in your index. If your documents are big, this could be a pretty huge latency penalty. -jake
Re: faceted search performance
On Monday 12 October 2009 14:53:45 Christoph Boosz wrote: Hi, I have a question related to faceted search. My index contains more than 1 million documents, and nearly 1 million terms. My aim is to get a DocIdSet for each term occurring in the result of a query. I use the approach described on http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.htmlhttps://service.gmx.net/de/cgi/derefer?TYPE=3DEST=http%3A%2F%2Fsujitpal.blogspot.com%2F2007%2F04%2Flucene-search-within-search-with.html, where a BitSet is built out of a QueryFilter for each term and intersected with the BitSet representing the user query. However, performance could be better. I guess it’s because the term filter considers each document in the index, even if it’s not in the result. My attempt to use a ChainedFilter, where the first filter (cached) is for the user query, and the second one for the term (done for all terms), didn’t speed things up, though. Am I missing something? Is there a better way to get the DocIdSets for a huge number of terms in a limited set of documents? Assuming you only need the number of documents within the original query that contain each term, one thing that can be saved is the allocation of the resulting BitSet for each term. To do this, use the cached BitSet (or the OpenBitSet in current lucene) for the original Query as a filter for a TermQuery per term, and then count the matching documents by using a counting HitCollector on the IndexSearcher. Regards, Paul Elschot
Re: faceted search performance
Chris, You could also store term vectors for all docs at indexing time, and add the termvectors for the matching docs into a (large) map of terms in RAM. Regards, Paul Elschot On Monday 12 October 2009 21:30:48 Christoph Boosz wrote: Hi Jake, Thanks for your helpful explanation. In fact, my initial solution was to traverse each document in the result once and count the contained terms. As you mentioned, this process took a lot of memory. Trying to confine the memory usage with the facet approach, I was surprised by the decline in performance. Now I know it's nothing abnormal, at least. Chris 2009/10/12 Jake Mannix jake.man...@gmail.com Hey Chris, On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz christoph.bo...@googlemail.com wrote: Thanks for your reply. Yes, it's likely that many terms occur in few documents. If I understand you right, I should do the following: -Write a HitCollector that simply increments a counter -Get the filter for the user query once: new CachingWrapperFilter(new QueryWrapperFilter(userQuery)); -Create a TermQuery for each term -Perform the search and read the counter of the HitCollector I did that, but it didn't get faster. Any ideas why? This killer is the TermQuery for each term part - this is huge. You need to invert this process, and use your query as is, but while walking in the HitCollector, on each doc which matches your query, increment counters for each of the terms in that document (which means you need an in-memory forward lookup for your documents, like a multivalued FieldCache - and if you've got roughly the same number of terms as documents, this cache is likely to be as large as your entire index - a pretty hefty RAM cost). But a good thing to keep in mind is that doing this kind of faceting (massively multivalued on a huge term-set) requires a lot of computation, even if you have all the proper structures living in memory: For each document you look at (which matches your query), you need to look at all of the terms in that document, and increment a counter for that term. So however much time it would normally take for you to do the driving query, it can take as much as that multiplied by the average number of terms in a document in your index. If your documents are big, this could be a pretty huge latency penalty. -jake
Re: speed of BooleanQueries on 2.9
On Wednesday 15 July 2009 17:16:23 Michael McCandless wrote: So now I'm confused. Since your query has required (+) clauses, the setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk. Probably the top level BQ is using BS2 because of the required clauses, but the nested BQ's are using BS because the docs are allowed out of order. In that case BS2 will use skipTo() on BS, and the BS.skipTo() implementation could well be the culprit for performance. A long time ago BS.skipTo() used to throw an unsupported operation exception, but that does not seem to be happening. Eks, could you try a toString() on the top level scorer for one of the affected queries to see whether it shows BS2 on top level and BS for the inner scorers? Regards, Paul Elschot BooleanQuery only uses BooleanScorer when there are no required terms, and allowDocsOutOfOrder is true. So I can't explain why you see this setting changing anything on this query... Mike On Tue, Jul 14, 2009 at 7:04 PM, eks deveks...@yahoo.co.uk wrote: I do not know exactly why, but when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but with setAllowDocsOutOfOrder(false); no problems whatsoever not really scientific method to find such bug, but does the job and makes me happy. Empirical, deprecated methods are not to be taken as thoroughly tested, as they have short life expectancy - Original Message From: eks dev eks...@yahoo.co.uk To: java-user@lucene.apache.org Sent: Wednesday, 15 July, 2009 0:24:43 Subject: Re: speed of BooleanQueries on 2.9 Mike, we are definitely hitting something with this one! we had report from our QA chaps that our servers got stuck (limit is on 180 Seconds Request)... We are on average 14 Requsts per second has nothing to do with gc() as we can repeat it with freshly restarted searcher. - it happens on a less than 0.1% of queries, not much of a pattern, repeatable on our index... it is always combination of two expanded tokens (we use minimumNooShouldMatch)... (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2])) all tokens are with set boost, and minNumShouldMatch is set to two I cannot provide self-contained test, nor index (contains sensitive data and is rather big, ~5G) I can repeat this test on t1 and t2 with 40 expansions each. even if I take the most frequent tokens in collection it runs well under one second...but these two particular tokens with their expansions are making it run forever... and yes, if I run t1 plus expansions only, it runs super fast, the same for t2 java 1.4U14, tried wit 1.6U6, no changes... will report if I dig something out partial stack trace while stuck, cpu is on max: org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown Source) org.apache.lucene.search.BooleanScorer.score(Unknown Source) org.apache.lucene.search.BooleanScorer.score(Unknown Source) org.apache.lucene.search.IndexSearcher.search(Unknown Source) org.apache.lucene.search.IndexSearcher.search(Unknown Source) org.apache.lucene.search.Searcher.search(Unknown Source) - Original Message From: eks dev To: java-user@lucene.apache.org Sent: Monday, 13 July, 2009 13:28:45 Subject: Re: speed of BooleanQueries on 2.9 Hi Mike, getMaxNumOfCandidates() in test was 200, Index is optimised and read-only We found (due to an error in our warm-up code, funny) that only this Query runs slower on 2.9. A hint where to look could be that this Query cointains two, the most frequent tokens in two particular fields NAME:hans and ZIPS:berlin (index has ca 80Mio very short documents, 3Mio unique terms) But all of this *could be just wrong measurement*, I just could not spend more time to get to the bottom of this. We moved forward as we got overall better average performance (sweet 10% in average) on much bigger real query log from our regression test. Anyhow I just wanted to throw it out, maybe it triggers some synapses :) If false alarm, sorry. - Original Message From: Michael McCandless To: java-user@lucene.apache.org Sent: Monday, 13 July, 2009 11:50:48 Subject: Re: speed of BooleanQueries on 2.9 This is not expected; 2.9 has had a number of changes that ought to reduce CPU cost of searching. If this holds up we definitely need to get to the root cause. Did your test exclude the warmup query for both 2.4.1 2.9? How many segments in the index? What is the actual value of getMaxNumOfCandidates()? If you simplify the query down (eg just do the NAME clause or the ZIPSS clause, alone) are those also 4X slower? Mike On Sun, Jul 12, 2009 at 12:53 PM, eks devwrote: Is it possible
Re: speed of BooleanQueries on 2.9
As long as next(), skipTo(), doc() and score() on a Scorer work, the search will be done. I hope the results are correct in this case, but I'm not sure. Regards, Paul Elschot On Wednesday 15 July 2009 19:08:00 Michael McCandless wrote: I don't think a toplevel BS2 is able to use BS as sub-scorers? BS2 needs to do doc-at-once, for all sub-scorers, but BS can't do that. I think? Mike On Wed, Jul 15, 2009 at 12:10 PM, Paul Elschotpaul.elsc...@xs4all.nl wrote: On Wednesday 15 July 2009 17:16:23 Michael McCandless wrote: So now I'm confused. Since your query has required (+) clauses, the setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk. Probably the top level BQ is using BS2 because of the required clauses, but the nested BQ's are using BS because the docs are allowed out of order. In that case BS2 will use skipTo() on BS, and the BS.skipTo() implementation could well be the culprit for performance. A long time ago BS.skipTo() used to throw an unsupported operation exception, but that does not seem to be happening. Eks, could you try a toString() on the top level scorer for one of the affected queries to see whether it shows BS2 on top level and BS for the inner scorers? Regards, Paul Elschot BooleanQuery only uses BooleanScorer when there are no required terms, and allowDocsOutOfOrder is true. So I can't explain why you see this setting changing anything on this query... Mike On Tue, Jul 14, 2009 at 7:04 PM, eks deveks...@yahoo.co.uk wrote: I do not know exactly why, but when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but with setAllowDocsOutOfOrder(false); no problems whatsoever not really scientific method to find such bug, but does the job and makes me happy. Empirical, deprecated methods are not to be taken as thoroughly tested, as they have short life expectancy - Original Message From: eks dev eks...@yahoo.co.uk To: java-user@lucene.apache.org Sent: Wednesday, 15 July, 2009 0:24:43 Subject: Re: speed of BooleanQueries on 2.9 Mike, we are definitely hitting something with this one! we had report from our QA chaps that our servers got stuck (limit is on 180 Seconds Request)... We are on average 14 Requsts per second has nothing to do with gc() as we can repeat it with freshly restarted searcher. - it happens on a less than 0.1% of queries, not much of a pattern, repeatable on our index... it is always combination of two expanded tokens (we use minimumNooShouldMatch)... (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2])) all tokens are with set boost, and minNumShouldMatch is set to two I cannot provide self-contained test, nor index (contains sensitive data and is rather big, ~5G) I can repeat this test on t1 and t2 with 40 expansions each. even if I take the most frequent tokens in collection it runs well under one second...but these two particular tokens with their expansions are making it run forever... and yes, if I run t1 plus expansions only, it runs super fast, the same for t2 java 1.4U14, tried wit 1.6U6, no changes... will report if I dig something out partial stack trace while stuck, cpu is on max: org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown Source) org.apache.lucene.search.BooleanScorer.score(Unknown Source) org.apache.lucene.search.BooleanScorer.score(Unknown Source) org.apache.lucene.search.IndexSearcher.search(Unknown Source) org.apache.lucene.search.IndexSearcher.search(Unknown Source) org.apache.lucene.search.Searcher.search(Unknown Source) - Original Message From: eks dev To: java-user@lucene.apache.org Sent: Monday, 13 July, 2009 13:28:45 Subject: Re: speed of BooleanQueries on 2.9 Hi Mike, getMaxNumOfCandidates() in test was 200, Index is optimised and read-only We found (due to an error in our warm-up code, funny) that only this Query runs slower on 2.9. A hint where to look could be that this Query cointains two, the most frequent tokens in two particular fields NAME:hans and ZIPS:berlin (index has ca 80Mio very short documents, 3Mio unique terms) But all of this *could be just wrong measurement*, I just could not spend more time to get to the bottom of this. We moved forward as we got overall better average performance (sweet 10% in average) on much bigger real query log from our regression test. Anyhow I just wanted to throw it out, maybe it triggers some synapses :) If false alarm, sorry. - Original Message From: Michael McCandless To: java-user@lucene.apache.org
Re: Boolean retrieval
It is also possible to use the HitCollector api and simply ignore the score values. Regards, Paul Elschot On Saturday 04 July 2009 21:14:41 Mark Harwood wrote: Check out booleanfilter in contrib/queries. It can be wrapped in a constantScoreQuery On 4 Jul 2009, at 17:37, Lukas Michelbacher miche...@ims.uni-stuttgart.de wrote: This is about an experiment comparing plain Boolean retrieval with vector-space-based retrieval. I would like to disable all of Lucene's scoring mechanisms and just run a true Boolean query that returns exactly the documents that match a query specified in Boolean syntax (OR, AND, NOT). No scoring or sorting required. As far as I can see, this is not supported out of the box. Which classes would I have to modify? Would it be enough to create a subclass of Similarity and to ignore all terms but one (coord, say) and make this term return 1 if the query matches the document and 0 otherwise? Lukas -- Lukas Michelbacher Institute for Natural Language Processing Universität Stuttgart email: miche...@ims.uni-stuttgart.de - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Need help : SpanNearQuery
To avoid passing all combinations to a NearSpansQuery some non trivial changes would be needed in the spans package. NearSpansUnOrdered (and maybe also NearSpansOrdered) would have to be extended to provide matching Spans when (the Spans of) not all terms/subqueries match. Also, quite likely, it will be necessary add a float getWeight() method to the Spans interface. This value could indicate how many terms/subqueries actually matched, and then be used in SpanScorer to provide a score for the matching document. This weight value would also be useful in other cases, for example to allow different weights in SpanTermQuery. Regards, Paul Elschot On Friday 17 April 2009 12:18:46 Radhalakshmi Sreedharan wrote: To make the question simple, What I need is the following : If my document field is ( ab,bc,cd,ef) and Search tokens are (ab,bc,cd). Given the following : I should get a hit even if all of the search tokens aren't present If the tokens are found they should be found within a distance x of each other ( proximity search) I need the percentage match of the search tokens with the document field. Currently this is my query : 1) I form all possible permutation of the search tokens 2) do a spanNearQuery of each permutation 3) Do a DisjunctionMaxQuery on the spannearqueries. This is how I compute % match : % match = ( Score by running the query on the document field ) / ( score by running the query on a document field created out of search tokens ) The numerator gives me the actual score with the search tokens run on the field. Denominator gives me the best possible or maximum possible score with the current search tokens For this example If my document field is ( ab,bc,cd,ef) and Search tokens are (ab,bc,cd). I expect a % match of around 90%. However I get a match of only around 50% without a boost. Using a boost infact reduces my percentage. I even overrode the queryNorm method to return a one, still the percentage did not increase. Any suggestions ? -Original Message- From: Radhalakshmi Sreedharan [mailto:radhalakshm...@infosys.com] Sent: Friday, April 17, 2009 12:37 PM To: java-user@lucene.apache.org Subject: RE: Need help : SpanNearQuery Hi Steven, Thanks for your reply. I tried out your approach and the problem got solved to an extent but still it remains. The problem is the score reduces quite a bit even now as bc is not found in the combinations ( bc,cd) ( bc,ef) and ( ab,bc,cd,ef) etc. The boosting infact has a negative impact and reduces the score further :( The factor which is affected by boosting is the queryNorm . With a boost of 6 - 0.015559823 = (MATCH) max of: 0.015559823 = (MATCH) weight(spanNear([SearchField:cd, SearchField:ef], 10, false)^6.0 in 0), product of: 0.07606166 = queryWeight(spanNear([SearchField:cd, SearchField:ef], 10, false)^6.0), product of: 6.0 = boost 0.61370564 = idf(SearchField: cd=1 ef=1) 0.02065639 = queryNorm 0.20456855 = (MATCH) fieldWeight(SearchField:spanNear([cd, ef], 10, false)^6.0 in 0), product of: 0.3334 = tf(phraseFreq=0.3334) 0.61370564 = idf(SearchField: cd=1 ef=1) 1.0 = fieldNorm(field=SearchField, doc=0) Without a boost - 0.07779912 = (MATCH) max of: 0.07779912 = (MATCH) weight(spanNear([SearchField:cd, SearchField:ef], 10, false) in 0), product of: 0.3803083 = queryWeight(spanNear([SearchField:cd, SearchField:ef], 10, false)), product of: 0.61370564 = idf(SearchField: cd=1 ef=1) 0.6196917 = queryNorm 0.20456855 = (MATCH) fieldWeight(SearchField:spanNear([cd, ef], 10, false) in 0), product of: 0.3334 = tf(phraseFreq=0.3334) 0.61370564 = idf(SearchField: cd=1 ef=1) 1.0 = fieldNorm(field=SearchField, doc=0) Regards, Radha -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Thursday, April 16, 2009 10:35 PM To: java-user@lucene.apache.org Subject: RE: Need help : SpanNearQuery Hi Radha, On 4/16/2009 at 8:35 AM, Radhalakshmi Sredharan wrote: I have a question related to SpanNearQuery. I need a hit even if there are 2/3 terms found with the span being applied for those 2 terms. Is there any custom implementation in place for this? I checked SrndQuery but that also doesn't work. This is my workaround currently: 1) For a list of terms ( ab,bc, cd,ef) , make a set like ( ab,bc) , ( bc,cd) ( ab,cd) (bc,ef) ( ab,bc,cd) ( ab,bc,cd,ef). and so on. 2) Create a spanNearQuery for each of these terms 3) Add it to the booleanQuery with a SHOULD clause. However this approach gives me puzzling scores eg If my document has only ( ab,bc,cd) the penalty for the missing ef is very high and my score comes down quite a bit. Do you know about the scoring documentation on the Lucene site: http
Re: Need help : SpanNearQuery
On Friday 17 April 2009 16:33:27 Radhalakshmi Sreedharan wrote: Thanks Paul. Is there any alternative way of implementing this requirement? Start from scratch perhaps? Anyway, spans can be really tricky, so in case you're writing code for this, I have only four advices: test, test, test and test. As a side note, Will the Shingle Filter help me getting all possible combination of the input tokens? I don't know. Regards, Paul Elschot
Re: Index in text format
On Thursday 09 April 2009 21:56:44 Andy wrote: Is there a way to have lucene to write index in a txt file? No. You could try a hexdump of the index file(s), but that isn't really human readable. Instead of that you may want to try Luke: http://www.getopt.org/luke/ Regards, Paul Elschot
Re: Internals question: BooleanQuery with many TermQuery children
On Tuesday 07 April 2009 05:04:44 Daniel Noll wrote: Hi all. This is something I have been wondering for a while but can't find a good answer by reading the code myself. If you have a query like this: ( field:Value1 OR field:Value2 OR field:Value3 OR ... ) How many TermEnum / TermDocs scans should this execute? (a) One per clause, or (b) One for the entire boolean query? One per clause. I wonder because we do use a lot of queries of this nature, and I can't find any direct evidence that they get logically merged, leading me to believe that it's one per clause at present (and thus this becomes a potential optimisation.) The problem is not only in the scanning of the TermDocs, but also in the merging by docId (on a heap) that has to take place when more of them are used at the same time during the query search. Some optimisations are already in place: - By allowing docs scored out of order, most top level OR queries can be merged with a faster algorithm (distributive sort over docId ranges) using the term frequencies (see BooleanQuery.setAllowDocsOutOfOrder()) - Various Filters that merge into a bitset, using a single TermDocs and ignoring term frequencies, (see MultiTermQuery.getFilter()). - The new TrieRangeFilter that premerges ranges at indexing time, also ignoring term frequencies. Using the TermDocs one by one has another advantage in that it reduces disk seek distances in the index. This is noticeable when disks have heads that take more time to move longer distances. SSD's don't have moving heads, so they have smaller performance differences between merging into a bitset, by distributive sort, and by a heap. For the time being, Lucene does not have a low level facility for key values that occur at most once per document field, so for these it normally helps to use a Filter. Regards, Paul Elschot
Re: Using SpanNearQuery.getSpans() in a Search Result
On Thursday 02 April 2009 15:36:44 David Seltzer wrote: Hi all, I'm trying to figure out how to use SpanNearQuery.getSpans(IndexReader) when working with a result set from a query. Maybe I have a fundamental misunderstanding of what an IndexReader is - I'm under the impression that it's a mechanism for sequentially accessing the documents in an index. So I'm not really sure how that helps me find the spans inside a search result. My problem is compounded by the fact that I'm using ParallelMultiSearcher so I'm not even 100% sure that I know what index each Hit is located in. It's the other way around: for span queries a search result is created (internally, by SpanScorer) from the spans resulting from the getSpans() method above. Does that help? Regards, Paul Elschot All of the examples I find (in LIA and from CNLP) demonstrate on an in-memory index created for the sake of the example. Can anyone give me any guidance on this? Thanks! -Dave
Re: number of hits of pages containing two terms
You may want to try Filters (starting from TermFilter) for this, especially those based on the default OpenBitSet (see the intersection count method) because of your interest in stop words. 10k OpenBitSets for 39 M docs will probably not fit in memory in one go, but that can be worked around by keeping fewer of them in memory. For non stop words, you could also try using SortedVIntList instead of OpenBitSet to reduce memory usage. In that case there is no direct intersection count, but a counting iteration over the intersection can be still done without actually forming the resulting filter. Regards, Paul Elschot On Tuesday 17 March 2009 12:35:19 Adrian Dimulescu wrote: Ian Lea wrote: Adrian - have you looked any further into why your original two term query was too slow? My experience is that simple queries are usually extremely fast. Let me first point out that it is not too slow in absolute terms, it is only for my particular needs of attempting the number of co-occurrences between ideally all non-noise terms (I plan about 10 k x 10 k = 100 million calculations). How large is the index? I indexed Wikipedia (the 8GB-XML dump you can download). The index size is 4.4 GB. I have 39 million documents. The particularity is that I cut Wikipedia in pararaphs and I consider each paragraph as a Document (not one page per Document as usual). Which makes a lot of short documents. Each document has a stored Id and a non-stored analyzed body : doc.add(new Field(id, id, Store.YES, Index.NO)); doc.add(new Field(text, p, Store.NO, Index.ANALYZED)); How many occurrences of your first or second terms? I do have in my index some words that are usually qualified as stop words. My first two terms are and : 13M hits and s : 4M hits. I use the SnowballAnalyzer in order to lemmatize words. My intuition is that the large number of short documents and the fact I am interested in the stop words do not help performance. Thank you, Adrian. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Speeding up RangeQueries?
On Saturday 14 March 2009 13:38:16 Niels Ott wrote: Hi all, I'm working on my prototype system and it turns out that RangeQueries are quite slow. In a first test I have about 80.000 documents in my index and I combine two range queries with a normal text query using the BooleanQuery. On the long run I will need to enhance my index at indexing-time so that the range queries will be substituted by simple keywords. Perhaps that is avoidable, see the reference below. For now, I'm interested in a possibility to speed up range queries. Does the performance of a range query depend on the length of contents in the field in question? Performance normally mostly depends on the number of terms indexed within the queried range. To limit the number of terms used during a range search, have a look here for more info on the new TrieRangeQuery: http://wiki.apache.org/lucene-java/SearchNumericalFields Regards, Paul Elschot
Re: Faceted search with OpenBitSet/SortedVIntList
On Tuesday 17 February 2009 10:12:12 Raffaella Ventaglio wrote: Thanks for sharing this info. In any case, this is not a problem for me since I have used only the idea to choose between OpenBitSet and SortedViIntList from contrib BooleanFilter, but I have then implemented it in my own facets manager structure, so I do not use the removed finalResult method. It would be possible to build a similar choice criterion between OpenBitSet and SortedVIntList into CachingWrapperFilter to chose the data structure to be used for caching. For example when using the same criterion as in the removed methods there, your original problem might not have occurred at all. In the CachingWrapperFilter in trunk the choice is left to an overridable method. Regards, Paul Elschot Regards, Raf On Sun, Feb 15, 2009 at 2:39 PM, Paul Elschot paul.elsc...@xs4all.nlwrote: Meanwhile the choice between SortedVIntList and OpenBitSet has been removed from the trunk (development version), that now uses OpenBitSet only: https://issues.apache.org/jira/browse/LUCENE-1296 In case there is preference to have SortedVIntList used in the next lucene version (i.e. in cases when it is smaller than OpenBitSet), please comment at LUCENE-1296. Regards, Paul Elschot
Re: Faceted search with OpenBitSet/SortedVIntList
Meanwhile the choice between SortedVIntList and OpenBitSet has been removed from the trunk (development version), that now uses OpenBitSet only: https://issues.apache.org/jira/browse/LUCENE-1296 In case there is preference to have SortedVIntList used in the next lucene version (i.e. in cases when it is smaller than OpenBitSet), please comment at LUCENE-1296. Regards, Paul Elschot On Sunday 08 February 2009 09:47:24 Raffaella Ventaglio wrote: Hi Paul, One way to implement that would be to use one of the boolean combination filters in contrib, BooleanFilter or ChainedFilter, and simply count the the number of times next() returns true on the result. I am sorry, but I cannot understand: how can I create a BooleanFilter or a ChainedFilter starting from two SortedVIntList objects? I have not found any filter that takes an existing DocIdSet in its constructor... However I have seen that Filter interface is very easy to implement. Should I create a custom Filter that wraps my SortedVIntList and than use these filters to create a BooleanFilter? Thanks, Raf
Re: Faceted search with OpenBitSet/SortedVIntList
John, On Sunday 08 February 2009 00:35:10 John Wang wrote: Our implementation of facet search can handle this. Using bitsets for intersection is not scalable performance wise when index is large. We are using a compact forwarded index representation in memory for the counting. Could you describe how this compact forwarded index works? Similar to FieldCache idea but more compact. Does this also use FieldCacheRangeFilter and/or FieldCacheTermsFilter? Regards, Paul Elschot
Re: Faceted search with OpenBitSet/SortedVIntList
On Sunday 08 February 2009 09:53:00 Uwe Schindler wrote: I would do so, it's really simple, you can even do it in an anonymous inner class. It is indeed simple, but it might also help to take a look at the source code of the Lucene classes involved. Regards, Paul Elschot - UWE SCHINDLER Webserver/Middleware Development PANGAEA - Publishing Network for Geoscientific and Environmental Data MARUM - University of Bremen Room 2500, Leobener Str., D-28359 Bremen Tel.: +49 421 218 65595 Fax: +49 421 218 65505 http://www.pangaea.de/ E-mail: uschind...@pangaea.de -Original Message- From: Raffaella Ventaglio [mailto:r.ventag...@gmail.com] Sent: Sunday, February 08, 2009 9:47 AM To: java-user@lucene.apache.org Subject: Re: Faceted search with OpenBitSet/SortedVIntList Hi Paul, One way to implement that would be to use one of the boolean combination filters in contrib, BooleanFilter or ChainedFilter, and simply count the the number of times next() returns true on the result. I am sorry, but I cannot understand: how can I create a BooleanFilter or a ChainedFilter starting from two SortedVIntList objects? I have not found any filter that takes an existing DocIdSet in its constructor... However I have seen that Filter interface is very easy to implement. Should I create a custom Filter that wraps my SortedVIntList and than use these filters to create a BooleanFilter? Thanks, Raf - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Faceted search with OpenBitSet/SortedVIntList
On Saturday 07 February 2009 19:57:19 Raffaella Ventaglio wrote: Hi, I am trying to implement a kind of faceted search using Lucene 2.4.0. I have a list of configuration rules that tell me how to generate this facets and the corresponding queries (that can range from simple term queries to complex boolean queries). When my application starts, it creates the whole set of facets objects and initializes them. For each facet: - I create the query according to the configured rule; - I ask the reader for the bitset corresponding to that query and I store it in the Facet object; - I get the cardinality of the bitset and I save it in the Facet object as its initial count. When the user does a search I have to update the counts associated to each Facet: - I get the bitset corresponding to the query + filter generated by the user search; - I get the cardinality of the (search bitset AND facet bitset) and I save it as the updated count. In my first solution, I used only OpenBitSetDISI objects, both for Facet bitset and for search bitset. So I could use intersectionCount method to get updated counts after user search. This works very well and it is very fast, but when the number of documents in the index and the number of facets grows it is too memory consuming. So I tried a different solution: when I create facet bitsets I use the same rule applied in ChainedFilter/BooleanFilter to decide if I have to store an OpenBitSet or a SortedVIntList. When I have to calculate updated counts: - if the facet has an OpenBitSet, I use the intersectionCount method directly; - if the facet has a SortedVIntList, I first create a new OpenBitSetDISI using the SortedVIntList.iterator and then I use the intersectionCount method. In this way, I use a smaller amount of memory at initialization time, but for each user search I create a large number of objects (that I suddenly throw away) and this affects application performance because it wastes a lot of time doing GC. So my question is: is there a better way to accomplish this task? I think, it would be fine if I could calculate intersectionCount directly on SortedVIntList objects, but I have not found nothing like that in Lucene 2.4 JavaDoc. Am I missing something? You are not missing anything. OpenBitSet has an optimized implementation for intersection count, and there is no counterpart of that in SortedVIntList because until now there has been no need for it. One way to implement that would be to use one of the boolean combination filters in contrib, BooleanFilter or ChainedFilter, and simply count the the number of times next() returns true on the result. In case the performance of that is not good enough, another way would be to directly add an intersection count method to SortedVIntList. However, SortedVIntList does not allow for an efficient iterator implementation of skipTo(), and skipTo() is used intensively by intersections. As a reference, now my index contains more than 500.000 documents and I have to create/manage up to 50.000 facets. Using second solution, at initialization time my facets structure requires more or less 120MB (and this is good enough), while updating counts it uses even 2GB of memory (and this is very bad). 50.000 facets? Well, in case the performance of the last suggestion is not good enough, one could try and implement a better data structure than OpenBitSet and SortedVIntList to provide a DocIdSetIterator, preferably with a fast skipTo() and possibly with a fast intersection count. In that case, you may want to ask further on the java-dev list. Regards, Paul Elschot
Re: TermScorer default buffer size
John, Continuing, see below. On Wednesday 07 January 2009 14:24:15 Paul Elschot wrote: On Wednesday 07 January 2009 07:25:17 John Wang wrote: Hi: The default buffer size (for docid,score etc) is 32 in TermScorer. We have a large index with some terms to have very dense doc sets. By increasing the buffer size we see very dramatic performance improvements. With our index (may not be typical), here are some numbers with buffer size w.r.t. performance in our query (a large OR query): Buffer-size improvement 2042 - 22.0 % 4084 - 39.1 % 8172 - 51.1 % I understand this may not be suitable for every application, so do you think it makes sense to make this buffer size configurable? Ideally the TermScorer buffer size could be set to a size depending on the query structure, but there is no facility for this yet. For OR queries larger buffers help, but not for AND queries. See also LUCENE-430 on reducing buffer sizes for the underlying TermDocs for very sparse doc sets. It may be possible to change the TermScorer buffer size dynamically. For OR queries TermScorer.next() is used, and for AND queries TermScorer.skipTo() is used. That means that when the buffer runs out during TermScorer.next(), it could be enlarged, for example by doubling (or quadrupling) the size to a configurable maximum of 8K or even 16K, see above. When TermScorer.skipTo() runs out of the buffer it could leave the buffer size unchanged. This involves some memory allocation during search. That is unusual, but it could be worthwhile given the performance improvement. Regards, Paul Elschot
Re: TermScorer default buffer size
On Friday 09 January 2009 05:29:15 John Wang wrote: Makes sense. I didn't think 32 was the empirically determined magic number ;) That number does have a history, but I don't know the details. Are you planning to do a patch for this? No, but could you open an issue and mention the performance improvements? Regards, Paul Elschot -John On Thu, Jan 8, 2009 at 1:27 AM, Paul Elschot paul.elsc...@xs4all.nl wrote: John, Continuing, see below. On Wednesday 07 January 2009 14:24:15 Paul Elschot wrote: On Wednesday 07 January 2009 07:25:17 John Wang wrote: Hi: The default buffer size (for docid,score etc) is 32 in TermScorer. We have a large index with some terms to have very dense doc sets. By increasing the buffer size we see very dramatic performance improvements. With our index (may not be typical), here are some numbers with buffer size w.r.t. performance in our query (a large OR query): Buffer-size improvement 2042 - 22.0 % 4084 - 39.1 % 8172 - 51.1 % I understand this may not be suitable for every application, so do you think it makes sense to make this buffer size configurable? Ideally the TermScorer buffer size could be set to a size depending on the query structure, but there is no facility for this yet. For OR queries larger buffers help, but not for AND queries. See also LUCENE-430 on reducing buffer sizes for the underlying TermDocs for very sparse doc sets. It may be possible to change the TermScorer buffer size dynamically. For OR queries TermScorer.next() is used, and for AND queries TermScorer.skipTo() is used. That means that when the buffer runs out during TermScorer.next(), it could be enlarged, for example by doubling (or quadrupling) the size to a configurable maximum of 8K or even 16K, see above. When TermScorer.skipTo() runs out of the buffer it could leave the buffer size unchanged. This involves some memory allocation during search. That is unusual, but it could be worthwhile given the performance improvement. Regards, Paul Elschot
Re: TermScorer default buffer size
On Wednesday 07 January 2009 07:25:17 John Wang wrote: Hi: The default buffer size (for docid,score etc) is 32 in TermScorer. We have a large index with some terms to have very dense doc sets. By increasing the buffer size we see very dramatic performance improvements. With our index (may not be typical), here are some numbers with buffer size w.r.t. performance in our query (a large OR query): Buffer-size improvement 2042 - 22.0 % 4084 - 39.1 % 8172 - 51.1 % I understand this may not be suitable for every application, so do you think it makes sense to make this buffer size configurable? Ideally the TermScorer buffer size could be set to a size depending on the query structure, but there is no facility for this yet. For OR queries larger buffers help, but not for AND queries. See also LUCENE-430 on reducing buffer sizes for the underlying TermDocs for very sparse doc sets. Regards, Paul Elschot
Re: Lucene retrieval model
Op Tuesday 30 December 2008 10:03:03 schreef Claudia Santos: Hello, I would like to know more about Lucene's retrieval model, more specifically about the boolean model. Is that a standard model or an extended model? I mean, it returns just documents that match the boolean expression or include in the search result all Documents which correspond to the given conditions, regardless of the boolean connectors - AND, OR, NOT and calculate a weight between 0 and 1 for all search results that contains at least one of the terms. The extended model evaluates documents with only one of the terms with a smaller value than one that contains both. In the Apache Lucene - Scoring's page i found not that much about: Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification. Lucene also adds some capabilities and refinements onto this model to support boolean and fuzzy searching, but it essentially remains a VSM based system at the heart. A somewhat refined Boolean model is used to determine a set of documents, and only for documents in that set a score value is calculated according the Lucene VSM model. The Boolean model in Lucene does not directly use the standard boolean connectors. Instead of that, each clause (term, subquery) is either required, optional or prohibited. The required and prohibited clauses determine a set of documents to be scored in the normal Boolean AND/NOT way. The refinement in the Boolean model is for the optional clauses: a minimum number of optional clauses may be required for documents to be part of the set that is scored. The normal Boolean OR operator has 1 as that minimum number, and in Lucene this minimum defaults to 1 when no required clauses are present. The required clauses and the optional clauses contribute to the score. One might consider the scoring of the optional clauses to be an implementation of the extended Boolean model. Fuzzy searching is implemented by constructing a Boolean query with optional (and actually present) terms that are similar enough to the fuzzy query term. Regards, Paul Elschot - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: BooleanQuery Performance Help
Op Saturday 20 December 2008 15:23:43 schreef Prafulla Kiran: Hi Everyone, I have an index of relatively small size (400mb) , containing roughly 0.7 million documents. The index is actually a copy of an existing database table. Hence, most of my queries are of the form +field1:value1 +field2:value2 +field3:value3. ~20 fields I have been running performance tests using this query. Strangely, I noticed that if I remove some specific clauses... I get a performance improvement of atleast 5 times. Here are the numbers and examples, so that I could be more precise 1) Complete Query: 90 requests per second using 10 threads 2) If I remove few specific clauses : 500 requests per second using 10 threads 3) If I form a new query using only 2 clauses from the set of removed clauses - 100 requests per second using 10 threads Now, some of these specific clauses are such that they match around half of the entire document set. Also, note that I need all the query terms to be present in the documents retrieved. My target is to obtain 300 requests per second with the given query (20 clauses). It includes 2 range queries. However, I am unable to get 300 rps unless I remove some of the clauses (which include these range queries) . I have tried using filters without any significant improvement in performance. Also, I have more than enough RAM, so I am using the RAMDirectory to read the index. I have optimized my index before searching. All the tests have been warmed for 5 seconds ( the test duration is 10 seconds). My first question is, is this kind of decrease in performance expected as the number of clauses shoot up ? Using a single clause out of these 20 , I was able to get 2000 requests per second! Could someone please guide me if there are any other ways in which I can obtain improvement in performance ? You might try and add brackets and a + around a group of the less frequently occurring terms, like this: +field1:frequentValue1 +field2:frequentValue2 +(+field3:inFrequentValue3 +field4:inFrequentValue4) This may help, and at least it should not degrade performance much. Also, it will affect score values somewhat. Particularly, I am interested to know more about what further caching could be done apart from the default caching which lucene does. More caching is probably not going to help. Regards, Paul Elschot - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: RESOLVED: help: java.lang.ArrayIndexOutOfBoundsException ScorerDocQueue.downHeap
Op Wednesday 17 December 2008 22:49:08 schreef 1world1love: Just an FYI in case anyone runs into something similar. Essentially I had indexes that I have been searching from a java stored procedure in Oracle without issue for awhile. All of a sudden, I started getting the error I alluded to above when there were more than a certain number of terms (4,5, or more depending on the terms or index). The error did not happen when I ran a query from a local server with the same filesystem mounted. In that case the root cause of the error could be in the JVM running the stored procedure. In any case, all of my indexes checked out OK. I read through all the other issues related to my issue but none of the fixes did anything. However, setting BooleanQuery.setAllowDocsOutOfOrder(true); did in fact make the error go away. Although I understand the idea behind the setting, I am not sure why it made a difference in my case. That option chooses another algorithm to search these queries, it will only affect queries without required terms. (The change in search algorithm is from BooleanScorer2 to BooleanScorer.) Regards, Paul Elschot - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)
Michael, The change from BitSet to DocIdSetIterator implies that you'll need to choose an underlying data structure yourself. A minimal approach would be to use DocIdBitSet around BitSet, but there are better ways. For your application you might consider to replace java's BitSet by lucene's OpenBitSet. Also have a look at earlier discussions on the subject: you might find a good use for OpenBitSetDISI and contrib/**/{BooleanFilter,ChainedFilter}. Regards, Paul Elschot Op Tuesday 09 December 2008 07:44:20 schreef Michael Stoppelman: Hi all, I'm working on upgrading to Lucene 2.4.0 from 2.3.2 and was trying to integrate the new DodIdSet changes since o.a.l.search.Filter#bits() method is now depreciated. For our app we actually heavily rely on bits from the Filter to do post-query filtering (I explain why below). For example, if someone searches for product: ipod and then filters a type: nano (e.g. mini/nano/regular) AND color: red (e.g. red/yellow/blue). In our current model the results are gathered in the following way: 1) ipod w/o attributes is run and the results are stored in a hitcollector 2) ipod results are now filtered for color=red AND type=mini using the lucene Filters 3) The filtered results are returned to the user. The reason that the attributes are filtered post-query is so that we can return the other types and colors the user can filter by in the future. Meaning the UI would be able to show blue, green, pink, etc... if we pre-filtered results by color and type before hand we wouldn't know what the other filter options would be there for a broader result set. Does anyone else have this use case? I'd imagine other folks are probably doing similar things to accomplish this. M - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 2.4 Performance
Op Wednesday 19 November 2008 03:39:01 schreef [EMAIL PROTECTED]: ... Our design is roughly as follows: we have some pre-query filters, queries typically involving around 25 clauses, and some post-processing of hits. We collect counts and filter post query using a hit collector, which uses the (now deprecated) bits() method of Filters. I looked at converting us to use the new DocIdSet infrastructure (to gain the supposed 30% speed bump), but this seems to be somewhat problematic as there is no guarantee for whether we will get back a set we can do binary operations on (for example, if we get back a SortedVIntList, we're pretty much out of luck - the cardinality of the set is large (as it's a sortedvintlist), so we can't coerce it into another type, and it doesn't have the set operations we need to use it directly. Is this part of the problem https://issues.apache.org/jira/browse/LUCENE-1296 ? Also consider o.a.l.util.OpenBitSetDISI, and how that is used in contrib/queries/**/BooleanFilter Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term numbering and range filtering
Tim, Op Wednesday 19 November 2008 02:32:40 schreef Tim Sturge: ... This is less than 2x slower than the dedicated bitset and more than 50x faster than the range boolean query. Mike, Paul, I'm happy to contribute this (ugly but working) code if there is interest. Let me know and I'll open a JIRA issue for it. In case you think more performance improvements based on this are possible... I think this is generally useful for range and set queries on non-text based fields (dates, location data, prices, general enumerations). These all have the required property that there is only one value (term) per document. I've opened LUCENE-1461. I finally got the point, see my comments there. Thanks a lot, if only to show an unexpected tradeoff possibility opened by the new Filter api. I don't know whether you followed LUCENE-584 (Decouple Filter from BitSet), but a contribution like this multi range filter makes it all worthwhile. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term numbering and range filtering
Op Wednesday 19 November 2008 00:43:56 schreef Tim Sturge: I've finished a query time implementation of a column stride filter, which implements DocIdSetIterator. This just builds the filter at process start and uses it for each subsequent query. The index itself is unchanged. The results are very impressive. Here are the results on a 45M document index: Firstly without an age constraint as a baseline: Query +name:tim startup: 0 Hits: 15089 first query: 1004 100 queries: 132 (1.32 msec per query) Now with a cached filter. This is ideal from a speed standpoint but there are too many possible start/end combinations to cache all the filters. Query +name:tim age:[18 TO 35] (ConstantScoreQuery on cached RangeFilter) startup: 3 Hits: 11156 first query: 1830 100 queries: 287 (2.87 msec per query) Now with an uncached filter. This is awful. Query +name:tim age:[18 TO 35] (uncached ConstantScoreRangeQuery) startup: 3 Hits: 11156 first query: 1665 100 queries: 51862 (yes, 518 msec per query, 200x slower) A RangeQuery is slightly better but still bad (and has a different result set) Query +name:tim age:[18 TO 35] (uncached RangeQuery) startup: 0 Hits: 10147 first query: 1517 100 queries: 27157 (271 msec is 100x slower than the filter) Now with the prebuilt column stride filter: Query +name:tim age:[18 TO 35] (ConstantScoreQuery on prebuilt column stride filter) With Allow Filter as clause to BooleanQuery: https://issues.apache.org/jira/browse/LUCENE-1345 one could even skip the ConstantScoreQuery with this. Unfortunately 1345 is unfinished for now. startup: 2811 Hits: 11156 first query: 1395 100 queries: 441 (back down to 4.41msec per query) This is less than 2x slower than the dedicated bitset and more than 50x faster than the range boolean query. Mike, Paul, I'm happy to contribute this (ugly but working) code if there is interest. Let me know and I'll open a JIRA issue for it. In case you think more performance improvements based on this are possible... Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term numbering and range filtering
Op Tuesday 11 November 2008 11:29:27 schreef Michael McCandless: The other part of your proposal was to somehow number term text such that term range comparisons can be implemented fast int comparison. ... http://fontoura.org/papers/paramsearch.pdf However that'd be quite a bit deeper change to Lucene. The cheap version is hierarchical prefixing here: http://wiki.apache.org/jakarta-lucene/DateRangeQueries Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term numbering and range filtering
Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless: Also, one nice optimization we could do with the term number column- stride array is do bit packing (borrowing from the PFOR code) dynamically. Ie since we know there are X unique terms in this segment, when populating the array that maps docID to term number we could use exactly the right number of bits. Enumerated fields with not many unique values (eg, country, state) would take relatively little RAM. With LUCENE-1231, where the fields are stored column stride on disk, we could do this packing during index such that loading at search time is very fast. Perhaps we'd better continue this at LUCENE-1231 or LUCENE-1410. I think what you're referring to is PDICT, which has frame exceptions for values that occur infrequently. Regards, Paul Elschot Mike Paul Elschot wrote: Op Tuesday 11 November 2008 11:29:27 schreef Michael McCandless: The other part of your proposal was to somehow number term text such that term range comparisons can be implemented fast int comparison. ... http://fontoura.org/papers/paramsearch.pdf However that'd be quite a bit deeper change to Lucene. The cheap version is hierarchical prefixing here: http://wiki.apache.org/jakarta-lucene/DateRangeQueries Regards, Paul Elschot --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term numbering and range filtering
Tim, I didn't follow all the details, so this may be somewhat off, but did you consider using TermVectors? Regards, Paul Elschot Op Monday 10 November 2008 19:18:38 schreef Tim Sturge: Yes, that is a significant issue. What I'm coming to realize is that either I will end up with something like class MultiFilter { String field; private int[] termInDoc; MapTerm,int termToInt; ... } which can be entirely built on the current lucene APIs but has significantly more overhead (the termToInt mapping in particular and the need to construct the mapping and array on startup) Or I can go deep into the guts and add a data file per-segment with a format something like int version int numFields (int fieldNum, long offset) ^ numFields (int termForDoc) ^ (maxDocs * numFields) and add something to FieldInfo like boolean storeMultiFilter; and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to add an int termNum to the .tis file as well. This is clearly a lot more work than the first solution, but it is a lot nicer to deal with as well. Is this interesting to anyone other than me? Tim On 11/9/08 12:23 PM, Michael McCandless [EMAIL PROTECTED] wrote: Conceivably, TermInfosReader could track the sequence number of each term. A seek/skipTo would know which sequence number it just jumped too, because the index is regular (every 128 terms by default), and then each next() call could increment that. Then retrieving this number would be as costly as calling eg IndexReader.docFreq(Term) is now. But I'm not sure how a multi-segment index would work, ie how would MultiSegmentReader compute this for its terms? Or maybe you'd just do this per-segment? Mike Tim Sturge wrote: Hi, I¹m wondering if there is any easy technique to number the terms in an index (By number I mean map a sequence of terms to a contiguous range of integers and map terms to these numbers efficiently) Looking at the Term class and the .tis/.tii index format it appears that the terms are stored in an ordered and prefix-compressed format, but while there are pointers from a term to the .frq and .prx files, neither is really suitable as a sequence number. The reason I have this question is that I am writing a multi-filter for single term fields. My index contains many fields for which each document contains a single term (e.g. date, zipcode, country) and I need to perform range queries or set matches over these fields, many of which are very inclusive (they match 10% of the total documents) A cached RangeFilter works well when there are a small number of potential options (e.g. for countries) but when there are many options (consider a date range or a set of zipcodes) there are too many potential choices to cache each possibility and it is too inefficient to build a filter on the fly for each query (as you have to visit 10% of documents to build the filter despite the query itself matching 0.1%) Therefore I was considering building a int[reader.maxDocs()] array for each field and putting into it the term number for each document. This relies on the fact that each document contains only a single term for this field, but with it I should be able to quickly construct a ³multi-filter² (that is, something that iterates the array and checks that the term is in the range or set). Right now it looks like I can do some very ugly surgery and perhaps use the offset to the prx file even though it is not contiguous. But I¹m hoping there is a better technique that I¹m just not seeing right now. Thanks, Tim --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term numbering and range filtering
Op Monday 10 November 2008 22:21:20 schreef Tim Sturge: Hmmm -- I hadn't thought about that so I took a quick look at the term vector support. What I'm really looking for is a compact but performant representation of a set of filters on the same (one term field). Using term vectors would mean an algorithm similar to: String myfield; String myterm; TermVector tv; for (int i = 0 ; i maxDoc ; i++) { tv = reader.getTermFreqVector(i,country) if (tv.indexOf(myterm) != -1) { // include this doc... } } The key thing I am looking to achieve here is performance comparable to filters. I suspect getTermFremVector() is not efficient enough but I'll give it a try. Better use a TermDocs on myterm for this, have a look at the code of RangeFilter. Filters are normally created from a slower query by setting a bit in an OpenBitSet at include this doc. Then they are reused for their speed. Filter caching could help. In case memory becomes a problem and the filters are sparse enough, try and use SortedVIntList as the underlying data structure in the cache. (Sparse enough means less than 1 in 8 of all docs available the index reader.) See also LUCENE-1296 for caching another data structure than the one used to collect the filtered docs. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to combine filter in Lucene 2.4?
Op Sunday 09 November 2008 11:56:37 schreef markharw00d: this can't be nearly as fast as OpenBitSet.intersect() or union, respectively, can it? I had a similar concern but it doesn't seem that bad: https://issues.apache.org/jira/browse/LUCENE-1187?focusedCommentId=12 596546#action_12596546 The above test showed a slight improvement using bitset.or when it was recognised both docidsets were OpenBitSets. This optimisation is now in BooleanFilter. Further to that, the current implementation of OpenBitSetDISI.inPlaceAnd() is not optimal, although it should work just fine. A patch for a performance improvement will follow. Regards, Paul Elschot Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to combine filter in Lucene 2.4?
Timo, You may be looking for class OpenBitSetDisi in the util package, it was made for boolean filter operations on OpenBitSets. Also have a look at the contrib modules, OpenBitSetDisi is used there in two classes that do (precisely?) what you need: contrib/miscellaneous/**/ChainedFilter contrib/queries/**/BooleanFilter Regards, Paul Elschot Op Saturday 08 November 2008 19:06:15 schreef Timo Nentwig: Hi! Since Filter.bits() is deprecated and replaced by getDocIdSet() now I wonder how I am supposed to combine (AND) filters (for facets). I worked around this issue by extending Filter and let getDocIdSet() return an OpenBitSet to ensure that this implementation is used everywhere and casting to OpenBitSet will work but this is really not clean code. Thanks Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting posting lists before intersection
Op Monday 13 October 2008 17:00:06 schreef Andrzej Bialecki: Renaud Delbru wrote: Hi Andrzej, sorry for the late reply. I have looked at the code. As far as I understand, you sort the posting lists based on the first doc skip. The first posting list will be the one who have the first biggest document skip. Do the sparseness of posting lists is a good predictor for sampling and ordering posting lists ? Do you know evaluation of such technique ? It is _some_ predictor ... :) whether it's a good one is another question. It's certainly very inexpensive - we don't do any additional IO except what we have to do anyway, which is scorer.skipTo(). In general case it's costly to calculate the frequency (or sparseness) of matches in a scorer without actually running the scorer through all its matches. In order to implement sorting based on frequency, we need the document frequency of each term. This information should be propagated through the Scorer classes (from TermScorer to higher level class such as ConjunctiveScorer). This will require a call to IndexReader.docFreq(term) for each of the term queries. Is docFreq call mean another IO access ? It sounds like you plan to order scorers by term frequency ... but in general case they won't all be TermScorers, so the frequency of documents matching a scorer won't have any particular connection to a single term freq. This could be done, but since not all scorers will be TermScorers it will be necessary to add a method to Scorer (or perhaps even to its DocIdSetIterator superclass): public abstract int estimatedDocFreq(); and implement this for all existing instances. TermScorer could implement it without estimating. For AND/OR/NOT such an estimation is straightforward but for proximity queries it would be more of a guess. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PhraseQuery issues - differences with SpanNearQuery
Op Friday 05 September 2008 16:57:34 schreef Mark Miller: Paul Elschot wrote: Op Thursday 04 September 2008 20:39:13 schreef Mark Miller: Sounds like its more in line with what you are looking for. If I remember correctly, the phrase query factors in the edit distance in scoring, but the NearSpanQuery will just use the combined idf for each of the terms in it, so distance shouldnt matter with spans (I'm sure Paul will correct me if I am wrong). SpanScorer will use the similarity slop factor for each matching span size to adjust the effective frequency. The span size is the difference in position between the first and last matching term, and idf is not used for scoring Spans. The reason why idf is not used could be that there is no basic score value associated with inner spans; only top level spans are scored by SpanScorer. For more details, please consult the SpanScorer code. Regards, Paul Elschot Right, my fault, its the query normalization in the weight which uses idf (by pulling from each clause in the span). So its kind of factored into the score, but not in the way I implied. Sorry, my bad on the info. Well, I had missed the phrase idf over all the SpanQuery terms as used from the SpanWeight. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PhraseQuery issues - differences with SpanNearQuery
Op Thursday 04 September 2008 20:39:13 schreef Mark Miller: Sounds like its more in line with what you are looking for. If I remember correctly, the phrase query factors in the edit distance in scoring, but the NearSpanQuery will just use the combined idf for each of the terms in it, so distance shouldnt matter with spans (I'm sure Paul will correct me if I am wrong). SpanScorer will use the similarity slop factor for each matching span size to adjust the effective frequency. The span size is the difference in position between the first and last matching term, and idf is not used for scoring Spans. The reason why idf is not used could be that there is no basic score value associated with inner spans; only top level spans are scored by SpanScorer. For more details, please consult the SpanScorer code. Regards, Paul Elschot - Mark Yannis Pavlidis wrote: Hi, I am having an issue when using the PhraseQuery which is best illustrated with this example: I have created 2 documents to emulate URLs. One with a URL of: http://www.airballoon.com; and title air balloon and the second one with URL http://www.balloonair.com; and title: balloon air. Test1 (PhraseQuery) == Now when I use the phrase query with - title: air balloon ~2 I get back: url: http://www.airballoon.com; - score: 1.0 url: http://www.balloonair.com; - score: 0.57 Test2 (PhraseQuery) == Now when I use the phrase query with - title: balloon air ~2 I get back: url: http://www.balloonair.com; - score: 1.0 url: http://www.airballoon.com; - score: 0.57 Test3 (PhraseQuery) == Now when I use the phrase query with - title: air balloon ~2 title: balloon air ~2 I get back: url: http://www.airballoon.com; - score: 1.0 url: http://www.balloonair.com; - score: 1.0 Test4 (SpanNearQuery) === spanNear([title:air, title:balloon], 2, false) I get back: url: http://www.airballoon.com; - score: 1.0 url: http://www.balloonair.com; - score: 1.0 I would have expected that Test1, Test2 would actually return both URLs with score of 1.0 since I am setting the slop to 2. It seems though that lucene really favors and absolute exact match. Is it safe to assume that for what I am looking for (basically score the docs the same regardless on when someone is searching for air balloon or balloon air) it would be better to use the SpanNearQuery rather than the PhraseQuery? Any input would be appreciated. Thanks in advance, Yannis. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Pre-filtering for expensive query
Op Wednesday 03 September 2008 18:06:57 schreef Matt Ronge: On Aug 30, 2008, at 3:01 PM, Paul Elschot wrote: Op Saturday 30 August 2008 18:19:09 schreef Matt Ronge: On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote: Can you tell us a bit more about what you custom query does? Perhaps you can build the candidate filter and reuse it over and over again? I cannot reuse it. The candidate filter would be constructed by first running a boolean query with a number of SHOULD clauses. So then I know what docs atleast contain the terms I'm looking for. Once I have this set, I will look at the ordering of the matches (it's a bit more sophisticated than just a phrase query) and find the final matches. Sounds like you may want to take a look at SpanNearQuery. I'm going to take a second look at SpanNearQuery. I need it to support optional tokens, so I'm guessing I'll need to create a subclass to do that. SpanNearQuery was not designed for optional tokens. This can be tricky so make sure your specs are good. I know only of this article for optional tokens and proximity: Kunihiko Sadakane and Hiroshi Imai. Fast algorithms for k -word proximity search. IEICE Trans. Fundamentals, E84-A(9), September 2001. Since my boolean clauses are different for each query I can't reuse the filter. With (a variation of) SpanNearQuery you may end up not needing any filtering at all, because it already uses skipTo() where possible. In case you are looking for documents that contain partial phrases from an input query that has more than 2 words, have a look at Nutch. I poked around in the Nutch docs and Javadocs, what should I look at in Nutch? What does it do exactly, is it the trick that Doug Cutting mentioned where you concat neighboring terms together like Hello world becomes the token hello.world? That is an optimization for combinations of high frequency terms, which is built into nutch iirc. But I don't know the details, so please ask on a nutch list. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Pre-filtering for expensive query
Op Saturday 30 August 2008 18:22:50 schreef Matt Ronge: On Aug 30, 2008, at 6:13 AM, Paul Elschot wrote: Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge: Hi all, I am working on implementing a new Query, Weight and Scorer that is expensive to run. I'd like to limit the number of documents I run this query on by first building a candidate set of documents with a boolean query. Once I have that candidate set, I was hoping I could build a filter off of it, and issue that along with my expensive query. However, after reading the code I see that filtering is done during the search, and not before hand. Correct. I suppose you mean the filtering code in IndexSearcher? Yes, that's exactly what I mean. As Grant pointed out, this code was recently changed by LUCENE-584. I was referring to the (current trunk) code including this change that uses skipTo() on a DocIdSetIterator obtained from the Filter. Sorry for any confusion on this. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Pre-filtering for expensive query
Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge: Hi all, I am working on implementing a new Query, Weight and Scorer that is expensive to run. I'd like to limit the number of documents I run this query on by first building a candidate set of documents with a boolean query. Once I have that candidate set, I was hoping I could build a filter off of it, and issue that along with my expensive query. However, after reading the code I see that filtering is done during the search, and not before hand. Correct. I suppose you mean the filtering code in IndexSearcher? So my initial boolean query won't help in limiting the number of documents scored by my expensive query. The trick of filtering is the use of skipTo() on both the filter and the scorer to skip superfluous work as much as possible. So when you make your scorer implement skipTo() efficiently, filtering it should reduce the amount of scoring done. Implementing skipTo() efficiently is normally done by using TermScorer.skipTo() on the leafs of a scorer structure. So, in case you implement your own TermScorer, take a serious look at TermScorer.skipTo(). Normally, score value computations are not the bottleneck, but accessing the index is, and this is where skipTo() does the real work. At the moment avoiding score value computations is a nice extra. Has anyone done any work into restricting the set of docs that a query operates on? Yes, Filters. Or should I just implement something myself in a custom scorer? In case you have a better way than skipTo(), or something to improve on this issue to allow a Filter as clause to BooleanQuery: https://issues.apache.org/jira/browse/LUCENE-1345 let us know. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Pre-filtering for expensive query
Op Saturday 30 August 2008 18:19:09 schreef Matt Ronge: On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote: Can you tell us a bit more about what you custom query does? Perhaps you can build the candidate filter and reuse it over and over again? I cannot reuse it. The candidate filter would be constructed by first running a boolean query with a number of SHOULD clauses. So then I know what docs atleast contain the terms I'm looking for. Once I have this set, I will look at the ordering of the matches (it's a bit more sophisticated than just a phrase query) and find the final matches. Sounds like you may want to take a look at SpanNearQuery. Since my boolean clauses are different for each query I can't reuse the filter. With (a variation of) SpanNearQuery you may end up not needing any filtering at all, because it already uses skipTo() where possible. In case you are looking for documents that contain partial phrases from an input query that has more than 2 words, have a look at Nutch. Regards, Paul Elschot -- Matt Hi all, I am working on implementing a new Query, Weight and Scorer that is expensive to run. I'd like to limit the number of documents I run this query on by first building a candidate set of documents with a boolean query. Once I have that candidate set, I was hoping I could build a filter off of it, and issue that along with my expensive query. However, after reading the code I see that filtering is done during the search, and not before hand. So my initial boolean query won't help in limiting the number of documents scored by my expensive query. Has anyone done any work into restricting the set of docs that a query operates on? Or should I just implement something myself in a custom scorer? Thanks in advance, -- Matt Ronge - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Pre-filtering for expensive query
Op Saturday 30 August 2008 18:22:50 schreef Matt Ronge: On Aug 30, 2008, at 6:13 AM, Paul Elschot wrote: Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge: Hi all, I am working on implementing a new Query, Weight and Scorer that is expensive to run. I'd like to limit the number of documents I run this query on by first building a candidate set of documents with a boolean query. Once I have that candidate set, I was hoping I could build a filter off of it, and issue that along with my expensive query. However, after reading the code I see that filtering is done during the search, and not before hand. Correct. I suppose you mean the filtering code in IndexSearcher? Yes, that's exactly what I mean. So my initial boolean query won't help in limiting the number of documents scored by my expensive query. The trick of filtering is the use of skipTo() on both the filter and the scorer to skip superfluous work as much as possible. So when you make your scorer implement skipTo() efficiently, filtering it should reduce the amount of scoring done. Implementing skipTo() efficiently is normally done by using TermScorer.skipTo() on the leafs of a scorer structure. So, in case you implement your own TermScorer, take a serious look at TermScorer.skipTo(). Normally, score value computations are not the bottleneck, but accessing the index is, and this is where skipTo() does the real work. At the moment avoiding score value computations is a nice extra. I was not aware of this. Where can I find the code that uses the filter to determine what values to feed to skipTo (I'm trying to get a better understand of the Lucene source)? It's the same code in IndexSearcher. ConjunctionScorer.skipTo() does the much the same thing for any number of scorers. Or should I just implement something myself in a custom scorer? In case you have a better way than skipTo(), or something to improve on this issue to allow a Filter as clause to BooleanQuery: https://issues.apache.org/jira/browse/LUCENE-1345 let us know. Thanks, if the skipTo approach doesn't work, I'll take a look at this. For the moment, Andrzej's suggestion to use FilteredQuery as a clause could well be good enough. Btw. FilteredQuery also contains a filtering scorer under the hood, you could take a look there, too. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fastest way to get just the bits of matching documents
Op Thursday 24 July 2008 23:00:33 schreef Robert Stewart: Queries are very complex in our case, some have up to 100 or more clauses (over several fields), including disjunctions and prohibited clauses. Other than the earlier advice, did you try setAllowDocsOutOfOrder() ? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scoring filters
Op Wednesday 11 June 2008 01:41:38 schreef Karl Wettin: Each of my filters represent single boosting term queries. But when using the filter instead o the boosting term query I loose the score (not sure this is true) and payload boost (if any), both essential for the quality of my results. If I was to add payloads to the bits that are set, what is the best or simplest way to get the score back in? How about wrapping each filter in a query? Are there any obvious problems with this strategy that I've missed? Why not add the boosting term queries as required to a BooleanQuery? This has the advantage that it uses the index data and the various caches built into Lucene and the underlying OS. In case you have the memory available, it is also possible to keep the score values of any Query with the Filter and implement a Scorer using the filter docs and these score values. Then use this as the scorer for a new Query, via a Weight. Once this new Query is available, just add it as required to a BooleanQuery. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SpanNearQuery: how to get the intra-span matching positions?
Op Friday 30 May 200812:10 schreef Claudio Corsi: Hi all, I'm querying my index with a SpanNearQuery built on top of some SpanOrQuery. Now, the Spans object I get form the SpanNearQuery instance returns me back the sequence of text spans, each defined by their starting/ending positions. I'm wondering if there is a simple way to get not only the start/end positions of the entire span, but the single matching positions inside such span. For example, suppose that a SpanNearQuery composed by 3 SpanTermQuery (with a slop of K) produce as resulting span the terms sequence: t0 t1 t2 t3 t100 (so start() == 0, end() == 100). I know that for sure t0 and t100 have generated a match, since the span is minimal (right?). Right. But make sure to test, some less than straightforward situations are possible when matching spans. For example, the subqueries may be SpanNearQuery's themselves instead of SpanTermQuery's. But I also know that there is a 3th match somewhere in the span (I have 3 SpanTermQuery that have to match). Is there a way to discover it? To get this information, you'll have to extend NearSpansOrdered and NearSpansUnordered (package private classes in o.a.l.search.spans) to also provide for example an int[] with the actual matching 'positions', or subspans each with their own begin and end. This is fairly straightforward, but to actually use such positions SpanScorer will also need to be extended or even replaced. In case you want to continue this discussion, please do so on java-dev. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SpanNearQuery scoring
Op Friday 23 May 2008 15:19:03 schreef Karl Wettin: Everything (scores, explainations and not hitting breakpoints while debugging) seems to point at that SpanNearQuery doesn't use the scoring of the inner spans. Is this true? Yes. If so, is it intentional? I don't know. The Spans interface does not contain a weight() or score() method, so there is no way to pass such information to SpanScorer. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multi word synonyms
Op Sunday 18 May 2008 16:30:26 schreef Karl Wettin: 18 maj 2008 kl. 00.01 skrev Paul Elschot: Op Saturday 17 May 2008 20:28:40 schreef Karl Wettin: As far as I know Lucene only handle single word synonyms at index time. My life would be much simpler if it was possible to add synonyms that spanned over multiple tokens, such as lucene in action=lia. I have a couple of workarounds that are OK but it really isn't the same thing when it comes down to the scoring. The simplest solution is to index such synonyms at the first or last or middle position of the source tokens, using a zero position increment for the synonym. Was this one of the workarounds? I get sloppyFreq problems with that. The advantage of the zero position increment is that the original token positions are not affected, so at least there is no influence on scoring because of changes in the original token positions. I copy a number of fields to a single one. Each such field can be represented in a number of languages or aliases in the same language. [a, b, c, d, e, f], [g, h, i],[j, k, l ,m] [o, p][u, v] [q, r, s, t] It would be great if the phrase query on [f, o, p, u, v] could yeild a 0 distance. If I'd been using the same synonyms for the same phrases in all documents at all times the edit distance would be static when scoring, but I don't. The terms of these synonyms are not really compatible with each other. For instance [f, g, s, t, j] should not be allowed or at least be heavily penalised compared to [f, o, p, j]. Searching a combination of languages should be allowed but preferably only one per field copied to the big field. (Disjunction is not applicable.) It is OK the way I have it running now, but more dimensions as described above really increases the score quality. I confirmed that using permutations of documents and filtering out the duplicates. Now I'm thinking it could be solved using token payloads and a brand new MultiDimensionalSpanQuery. Not too different from what you suggested way back in http://www.nabble.com/Using-Lucene-for-searching-tokens%2C-not-storin g-them.-to3918462.html#a3944016 That would mean a term extending tag to indicate that a term is on an alternative path? There are some other issues too, but I'm not at liberty to disclose too much. I hope it still makes sense? Yes. I suppose the payload would indicate how much the alternative path length differs from the original path? In case you can't disclose more, no answer would off course be ok, too. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiTerm Or Query with per-term boost. Does it exist?
See below. Op Sunday 18 May 2008 21:03:19 schreef John Jensen: The only problem is, that I'm thinking that a special purpose Query subclass might be faster, but I was wondering if others have run into similar situations, and whether they saw performance win by replacing complex BooleanQueries with a special purpose Query subclass. Unfortunately the boosts are query specific and can't be done at index time. Thanks, John On Sun, May 18, 2008 at 9:30 AM, Karl Wettin [EMAIL PROTECTED] wrote: 18 maj 2008 kl. 02.25 skrev John Jensen: Hi, I have an application where I need to issue queries with a large number of or-terms with individual boosts. Currently I just construct a BooleanQuery with a large number (often 1000) of constituent TermQueries. I'm wondering if there is a better way to do this? I'm open to implementing my own Query subclass if I can expect significant performance improvements from doing this. Does BooleanQuery.setAllowDocsOutOfOrder() make a difference? Regards, Paul Elschot What is the general problem with your approach? And what does all these boosted term queries represent? Would it be perhaps be possible for you to add the boost at index time instead of at query time? karl --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: theoretical maximum score
Op Saturday 17 May 2008 00:04:31 schreef Chris Hostetter: : Is it possible to compute a theoretical maximum score for a given : query if constraints are placed on 'tf' and 'lengthNorm'? If so, : scores could be compared to a 'perfect score' (a feature request : from our customers) I think a theoretical maximum score is only going to work when that maximum applies to queries of any structure. So, start with the simplest query, associate it with a theoretical maximum score, and then for each possible combination of subqueries ((weighted) and/or/phrase/span) make sure that the subscore values are combined into another value that has the same theoretical maximum. Have a look here to start: https://issues.apache.org/jira/browse/LUCENE-293 Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multi word synonyms
Op Saturday 17 May 2008 20:28:40 schreef Karl Wettin: As far as I know Lucene only handle single word synonyms at index time. My life would be much simpler if it was possible to add synonyms that spanned over multiple tokens, such as lucene in action=lia. I have a couple of workarounds that are OK but it really isn't the same thing when it comes down to the scoring. The thing that does the best job at scoring was to assemble several permutations of the same document. But it doesn't feel good. I have cases where that means several hundred documents, and I have to do post processing to filter out the duplicate hits. It can turn out to be rather expensive. And I'm sure it mess with the scoring in several ways I did not notice yet. I've also considering creating some multi dimensional term position space, but I'd say that could take a lot of time to implement. Are there any good solutions to this? The simplest solution is to index such synonyms at the first or last or middle position of the source tokens, using a zero position increment for the synonym. Was this one of the workarounds? The advantage of the zero position increment is that the original token positions are not affected, so at least there is no influence on scoring because of changes in the original token positions. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering a SpanQuery
Op Wednesday 07 May 2008 10:18:38 schreef Eran Sevi: Thanks Paul for your reply, Since my index contains a couple of millions documents and the filter is supposed to limit the search space to a few thousands I was hoping I won't have to do the filtering myself after running the query on all the index. The code I gave earlier effectively does a filtered query search on the index. It visits the resulting Spans, and does not provide a score value per document as SpanScorer would do. Please make sure to test that code thoroughly for reliable results. Maybe this is the case anyway and behind the scenes the filter does exactly what you suggested. Yes, a filtered query search would use skipTo() on the Spans via SpanScorer. But the difference between the normal case and your case is that you don't need SpanScorer. From what I tested the number of results of the SpanQuery greatly affects the running speed so if I'm going to use about 0.1% of the results I'm loosing a lot of time and memory for gathering and storing the spans I'm not going to use. I don't know how SpanQuery works internally but I guess that if the filter is known beforehand, A Filter needs to make a BitSet available before the query search. it could speed things up quite a bit. I would expect a substantial speedup from using skipTo() on the Spans when only 0.1% of the results passes the filter. Regards, Paul Elschot Eran. On Wed, May 7, 2008 at 10:34 AM, Paul Elschot [EMAIL PROTECTED] wrote: Op Tuesday 06 May 2008 17:39:38 schreef Paul Elschot: Eran, Op Tuesday 06 May 2008 10:15:10 schreef Eran Sevi: Hi, I am looking for a way to filter a SpanQuery according to some other query (on another field from the one used for the SpanQuery). I need to get access to the spans themselves of course. I don't care about the scoring of the filter results and just need the positions of hits found in the documents that matches the filter. I think you'll have to implement the filtering on the Spans yourself. That's not really difficult, just use Spans.skipTo(). The code to do that could look sth like this (untested): Spans spans = yourSpanQuery.getSpans(reader); BitSet bits = yourFilter.bits(reader); int filterDoc = bits.nextSetBit(0); while ((filterDoc = 0) and spans.skipTo(filterDoc)) { boolean more = true; while (more and (spans.doc() == filterDoc)) { // use spans.start() and spans.end() here // ... more = spans.next(); } if (! more) { break; } filterDoc = bits.nextSetBit(spans.doc()); At this point, no skipping on the spans should be done when filterDoc equals spans.doc(), so this code still needs some work. But I think you get the idea. Regards, Paul Elschot } Please check the javadocs of java.util.BitSet, there may be a 1 off error in the arguments to nextSetBit(). Regards, Paul Elschot I tried looking through the archives and found some reference to a SpanQueryFilter patch, however I don't see how it can help me achieve what I want to do. This class receives only one query parameter (which I guess is the actual query) and not a query and a filter for example. Any help about how I can achieve this will be appreciated. Thanks, Eran. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering a SpanQuery
Eran, Op Tuesday 06 May 2008 10:15:10 schreef Eran Sevi: Hi, I am looking for a way to filter a SpanQuery according to some other query (on another field from the one used for the SpanQuery). I need to get access to the spans themselves of course. I don't care about the scoring of the filter results and just need the positions of hits found in the documents that matches the filter. I think you'll have to implement the filtering on the Spans yourself. That's not really difficult, just use Spans.skipTo(). The code to do that could look sth like this (untested): Spans spans = yourSpanQuery.getSpans(reader); BitSet bits = yourFilter.bits(reader); int filterDoc = bits.nextSetBit(0); while ((filterDoc = 0) and spans.skipTo(filterDoc)) { boolean more = true; while (more and (spans.doc() == filterDoc)) { // use spans.start() and spans.end() here // ... more = spans.next(); } if (! more) { break; } filterDoc = bits.nextSetBit(spans.doc()); } Please check the javadocs of java.util.BitSet, there may be a 1 off error in the arguments to nextSetBit(). Regards, Paul Elschot I tried looking through the archives and found some reference to a SpanQueryFilter patch, however I don't see how it can help me achieve what I want to do. This class receives only one query parameter (which I guess is the actual query) and not a query and a filter for example. Any help about how I can achieve this will be appreciated. Thanks, Eran. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Proximity Searches
Ana, Op Friday 18 April 2008 12:41:38 schreef Ana Rabade: I am using ngrams and I need to force that a group of them are together, but if any of them fails, I need that the document is also scored. Perhaps you could help me to find the solution or give me a reference of which changes I must do. I am using SpanNearQuery, because the ngrams must be in order. Thanks for your answer. - Ana Maria Freire Veiga - Assuming that K terms are involved and K-1 of them need to match in order as ngrams, there are the following options: - create K SpanNearQuery's on K-1 ordered terms with appropriate slop, add these to a BooleanQuery using Occur.SHOULD, and search this BooleanQuery. - starting from the same K SpanNearQuery's on K-1 terms, search each of these separately and use your own HitCollector to combine the scores. For these two options, one could also use the K terms SpanNearQuery to influence the scoring somewhat. The problem with these options is that the number of terms in the query is quadratic in K, possibly giving performance problems for higher values of K. In that case, try the third option: - modify the code of the NearSpansOrdered class in the org.apache.lucene.search.spans package to allow a match for less than all subqueries. This is not going to be straightforward, but it is possible. In case you choose this last option, please continue on the java-dev list. Regards, Paul Elschot On Fri, Apr 4, 2008 at 12:38 PM, Ana Rabade [EMAIL PROTECTED] wrote: I am using ngrams and I need to force that a group of them are together, but if any of them fails, I need that the document is also scored. Perhaps you could help me to find the solution or give me a reference of which changes I must do. I am using SpanNearQuery, because the ngrams must be in order. Thanks for your answer. - Ana Maria Freire Veiga - On Thu, Apr 3, 2008 at 7:56 PM, Erick Erickson [EMAIL PROTECTED] wrote: Could you explain your use case? Because to say that you want to score documents that don't have all the terms with a *phrase query* is contradictory. The point of a phrase query is exactly that all the terms are there and within some some proximity. Best Erick On Thu, Apr 3, 2008 at 12:17 PM, Ana Rábade [EMAIL PROTECTED] wrote: Hi! I'm using Lucene Proximity Searches, but I've seen Lucene only scores documents which contain all the terms in the phrase. I also need to score documents although they don't contain all those terms. Is it possible with Lucene PhraseQueries or SpanNearQuery? If not, could you tell me a way to find my solution? Thank you very much. - Ana M. Freire - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryWrapperFilter question...
Op Thursday 17 April 2008 06:37:18 schreef Michael Stoppelman: Actually, I screwed up the timing info. I wasn't including the time for the QueryWrapperFilter#bits(IndexReader) call. Sadly, it actually takes longer than the original query that had both terms included. Bummer. I had really convinced myself till the thought came to me at lunch :). For a single query, adding a filter off course has a cost. But when the location part can be reused in later queries, give CachingWrapperFilter a try. Regards, Paul Elschot -M On Wed, Apr 16, 2008 at 6:43 PM, Karl Wettin [EMAIL PROTECTED] wrote: Michael Stoppelman skrev: Hi all, I've been doing some performance testing and found that using QueryWrapperFilter for a location field restriction I have to do allows my search results to approach 5-10ms. This was surprising. Before the performance was between 50ms-100ms. The queries from before the optimization look like the following: +(+(text:cats) +(loc:1 loc:2 loc:3 ...)) The QueryWrapperFilter does do a search itself. Why would performance be so drastically different when the QueryWrapperFilter needs to do a search? Does lucene just not have the statistics to optimize this query so it can decide which terms to filter by first? Do you wonder why a QueryWrapperFilter is faster than a Query? Then the answer is that the filter uses a bitset to know if a document matches a document or not. For each document that match text:cats it checks the flag in the bitset for that document number instead of seeking in the index to find out if also match loc:1, loc:2 or loc:3. karl --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene partly as DB and 'joining' search results.
Op Saturday 12 April 2008 00:03:13 schreef Antony Bowesman: Paul Elschot wrote: Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme: Use Filter and BitSet. From the personnal data, you build a Filter (http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/ Fil ter.html) wich is used in the main index. With 1 billion mails, and possibly a Filter per user, you may want to use more compact filters than BitSets, which is currently possible in the development trunk of lucene. Thanks for the pointers. I've already used Solr's DocSet interface in my implementation, which I think is where the ideas for the current Lucene enhancements came from. The ideas came from quite a few sources. They can be traced starting from changes.txt in the sources. They work well to reduce the filter's footprint. I'm also caching filters. The intention is that there is a user data index and the mail index(es). The search against user data index will return a set of mail Ids, which is the common key between the two. Doc Ids are no good between the indexes, so that means a potentially large boolean OR query to create the filter of labelled mails in the mail indexes. I know it's a theoretical question, but will this perform? The normal way to collect doc ids for a filter is into a bitset iterating over the indexed ids (mail ids in your case). A bitset has random access, so there is no need to do this in doc id order. An OR query has to work in doc id order so it can compute a score per doc id, and the ordering loses some performance. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene partly as DB and 'joining' search results.
Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme: Antony Bowesman a écrit : We're planning to archive email over many years and have been looking at using DB to store mail meta data and Lucene for the indexed mail data, or just Lucene on its own with email data and structure stored as XML and the raw message stored in the file system. For some customers, the volumes are likely to be well over 1 billion mails over 10 years, so some partitioning of data is needed. At the moment the thoughts are moving away from using a DB + Lucene to just Lucene along with a file system representation of the complete message. All searches will be against the index then the XML mail meta data is loaded from the file system. The archive is read only apart from bulk deletes, but one of the requirements is for users to be able to label their own mail. Given that a Lucene Document cannot be updated, I have thought about having a separate Lucene index that has just the 3 terms (or some combination of) userId + mailId + label. That of course would mean joining searches from the main mail data index and the label index. Does anyone have any experience of using Lucene this way and is it a realistic option of avoiding the DB at all? I'd rather the headache of scaling just Lucene, which is a simple beast, than the whole bundle of 'stuff' that comes with the database as well. Use Filter and BitSet. From the personnal data, you build a Filter (http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Fil ter.html) wich is used in the main index. With 1 billion mails, and possibly a Filter per user, you may want to use more compact filters than BitSets, which is currently possible in the development trunk of lucene. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why Lucene has to rewrite queries prior to actual searching?
Op Tuesday 08 April 2008 15:18:34 schreef Itamar Syn-Hershko: Paul, I don't see how this answers the question. Towards the end, the page describes when a Scorer is called and roughly what it does. I was asking why Lucene has to access the index with exact terms, and not use RegEx or simpler wildcards support internally? If Lucene will be able to look for w?rd or wor* and treat the wildcards as wildcards, this will greatly improve speed of searches and will eliminate the need for Query rewriting. When it is known in advance that w?rd and wor* will be used in queries a lot, one can write a tokenizer that indexes them so that they can be searched directly. The problem is to know that in advance, that is at indexing time. Since some people may want to index chars like those used in wildcards, they could be escaped (or, those people will use the standard search classes available today instead). I'm not entirely sure what part of Lucene does the actual access to the terms and position vectors, but if it could be sub-classed or cloned, and then modified to honor wildcards or even RegEx, that would bring Lucene to new heights. There are regular expression queries in the regex contrib module, however these work by rewriting to actually indexed terms. Unless, again, there is a specific reason why this can't be done. There is no specific reason why it cannot be done, one only needs to provide the corresponding tokenizer to be used at indexing time. Kind regards, Paul Elschot Itamar. -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 08, 2008 1:56 AM To: java-user@lucene.apache.org Subject: Re: Why Lucene has to rewrite queries prior to actual searching? Itamar, Have a look here: http://lucene.apache.org/java/2_3_1/scoring.html Regards, Paul Elschot Op Tuesday 08 April 2008 00:34:48 schreef Itamar Syn-Hershko: Paul and John, Thanks for your quick reply. The problem with query rewriting is the beforementioned MaxClauseException. Instead of inflating the query and passing a deterministic list of terms to the actual search routine, Lucene could have accessed the vectors in the index using some sort of filter. So, for example, if it knows to access Foobar by its name in the index, why can't it take Foo* and just get all the vectors until Fop is met (for example). Why does it have to get deterministic list of terms? I will take a look at the Scorer - can you describe in short what exactly it does and where and when it is being called? I don't get John's comment though - Query::rewrite is being called prior to the actual searching (through QueryParser), how come it can use information gathered from IndexReader at search time? Itamar. -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 08, 2008 12:57 AM To: java-user@lucene.apache.org Subject: Re: Why Lucene has to rewrite queries prior to actual searching? Itamar, Query rewrite replaces wildcards with terms available from the index. Usually that involves replacing a wildcard with a BooleanQuery that is an effective OR over the available terms while using a flat coordination factor, i.e. it does not matter how many of the available terms actually match a document, as long as at least one matches. For the required query parts (AND like), Scorer.skipTo() is used, and that could well be the filter mechanism you are referring to; have a look at the javadocs of Scorer, and, if necessary, at the actual code of ConjunctionScorer. Regards, Paul Elschot Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko: Hi all, Can someone from the experts here explain why Lucene has to get a rewritten query for the Searcher - so Phrase or Wildcards queries have to rewrite themselves into a primitive query, that is then passed to Lucene to look for? I'm probably not familiar too much with the internals of Lucene, but I'd imagine that if you can inflate a query using wildcards via Query sub classing, you could as easily (?) have some sort of Filter mechanism during the search, so that Lucene retrieves the Position vectors for all the terms that pass that filter, instead of retrieving only the position data for deterministic terms (with no wildcards etc.). If that was possible to do somehow, it could greatly increase the searchability of Lucene indices by using RegEx (without re-writing and getting the dreaded MaxClauseCount error) and similar. Would love to hear some insights on this one. Itamar. --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
Re: Why Lucene has to rewrite queries prior to actual searching?
Itamar, Query rewrite replaces wildcards with terms available from the index. Usually that involves replacing a wildcard with a BooleanQuery that is an effective OR over the available terms while using a flat coordination factor, i.e. it does not matter how many of the available terms actually match a document, as long as at least one matches. For the required query parts (AND like), Scorer.skipTo() is used, and that could well be the filter mechanism you are referring to; have a look at the javadocs of Scorer, and, if necessary, at the actual code of ConjunctionScorer. Regards, Paul Elschot Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko: Hi all, Can someone from the experts here explain why Lucene has to get a rewritten query for the Searcher - so Phrase or Wildcards queries have to rewrite themselves into a primitive query, that is then passed to Lucene to look for? I'm probably not familiar too much with the internals of Lucene, but I'd imagine that if you can inflate a query using wildcards via Query sub classing, you could as easily (?) have some sort of Filter mechanism during the search, so that Lucene retrieves the Position vectors for all the terms that pass that filter, instead of retrieving only the position data for deterministic terms (with no wildcards etc.). If that was possible to do somehow, it could greatly increase the searchability of Lucene indices by using RegEx (without re-writing and getting the dreaded MaxClauseCount error) and similar. Would love to hear some insights on this one. Itamar. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why Lucene has to rewrite queries prior to actual searching?
Itamar, Have a look here: http://lucene.apache.org/java/2_3_1/scoring.html Regards, Paul Elschot Op Tuesday 08 April 2008 00:34:48 schreef Itamar Syn-Hershko: Paul and John, Thanks for your quick reply. The problem with query rewriting is the beforementioned MaxClauseException. Instead of inflating the query and passing a deterministic list of terms to the actual search routine, Lucene could have accessed the vectors in the index using some sort of filter. So, for example, if it knows to access Foobar by its name in the index, why can't it take Foo* and just get all the vectors until Fop is met (for example). Why does it have to get deterministic list of terms? I will take a look at the Scorer - can you describe in short what exactly it does and where and when it is being called? I don't get John's comment though - Query::rewrite is being called prior to the actual searching (through QueryParser), how come it can use information gathered from IndexReader at search time? Itamar. -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 08, 2008 12:57 AM To: java-user@lucene.apache.org Subject: Re: Why Lucene has to rewrite queries prior to actual searching? Itamar, Query rewrite replaces wildcards with terms available from the index. Usually that involves replacing a wildcard with a BooleanQuery that is an effective OR over the available terms while using a flat coordination factor, i.e. it does not matter how many of the available terms actually match a document, as long as at least one matches. For the required query parts (AND like), Scorer.skipTo() is used, and that could well be the filter mechanism you are referring to; have a look at the javadocs of Scorer, and, if necessary, at the actual code of ConjunctionScorer. Regards, Paul Elschot Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko: Hi all, Can someone from the experts here explain why Lucene has to get a rewritten query for the Searcher - so Phrase or Wildcards queries have to rewrite themselves into a primitive query, that is then passed to Lucene to look for? I'm probably not familiar too much with the internals of Lucene, but I'd imagine that if you can inflate a query using wildcards via Query sub classing, you could as easily (?) have some sort of Filter mechanism during the search, so that Lucene retrieves the Position vectors for all the terms that pass that filter, instead of retrieving only the position data for deterministic terms (with no wildcards etc.). If that was possible to do somehow, it could greatly increase the searchability of Lucene indices by using RegEx (without re-writing and getting the dreaded MaxClauseCount error) and similar. Would love to hear some insights on this one. Itamar. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Improving Index Search Performance
Since you're using all the results for a query, and ignoring the score value, you might try and do the same thing with a relational database. But I would not expect that to be much faster, especially when using a field cache. Other than that, you could also go the other way, and try and add more data to the lucene index that can be used to reduce the number of results to be fetched. Regards, Paul Elschot Op Wednesday 26 March 2008 13:51:24 schreef Shailendra Mudgal: The bottom line is that reading fields from docs is expensive. FieldCache will, I believe, load fields for all documents but only once - so the second and subsequent times it will be fast. Even without using a cache it is likely that things will speed up because of caching by the OS. As i mentioned in my previous mail that the companyId is a multivalued field, so caching it will consume a lot of memory. And this way we'll have to keep the document vs field mapping also in the memory. If you've got plenty of memory vs index size you could look at RAMDirectory or MMapDirectory. Or how about some solid state disks? Someone recently posted some very impressive performance stats. The index size is around 20G and the available Memory is 4G so, keeping the entire index into the memory is not possible. But as i mentioned earlier that it is using only 1 G out of 4 G, so is their a way to specify the lucene to cache more documents , say use 2G for caching the index ?? I'll appreciate more suggestions on the same problem. Regards, Vipin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Improving Index Search Performance
Shailendra, Have a look at the javadocs of HitCollector: http://lucene.apache.org/java/2_3_0/api/core/org/apache/lucene/search/HitCollector.html The problem is with the use of the disk head, when retrieving the documents during collecting, the disk head has to move between the inverted index and the stored documents; see also the file formats. To avoid such excessive disk head movement, you need to collect all (or at least many more than 1 of) your document ids during collect(), for example into an int[]. After collecting retrieve the all the docs with Searcher.doc(). Also, for the same reason, retrieving docs is best done in doc id order, but that is unlikely to go wrong as doc ids are normally collected in increasing order. Regards, Paul Elschot Op Tuesday 25 March 2008 13:43:18 schreef Shailendra Mudgal: Hi Everyone, We are using Lucene to search on a index of around 20G size with around 3 million documents. We are facing performance issues loading large results from the index. Based on the various posts on the forum and documentation, we have made the following code changes to improve the performance: i. Modified the code to use HitCollector instead of Hits since we will be loading all the documents in the index based on keyword matching ii. Added MapFieldSelector to load only selected fields(2 fields only) instead of all the 14 After all these changes, it seems to be taking around 90 secs to load 17k documents. After profiling, we found that the max time is spent in * searcher.doc(id,selector). *Here is the code: *public void collect(int id, float score) { try { MapFieldSelector selector = new MapFieldSelector(new String[] {COMPANY_ID, ID}); doc = searcher.doc(id, selector); mappedCompanies = doc.getValues(COMPANY_ID); } catch (IOException e) { logger.debug(inside IDCollector.collect() :+e.getMessage()); } }* * *We also read in one of the posts that we should use bitSet.set(doc) instead of calling searcher.doc(id). But we are unable to to understand how this might help in our case since we will anyway have to load the document to get the other required field(company_id). Also we observed that the searcher is actually using only 1G RAM though we have 4G allocated to it. Can someone suggest if there is any other optimization that can done to improve the search performance on MultiSearcher. Any help would be appreciated. Thanks, Vipin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Call Lucene default command line Search from PHP script
Milu, This is a PHP problem, not a Lucene one, so you might get better response at a PHP mailing list. The easy way around your problem is probably by invoking a shell script from php that exports the class path as you indicated, so that java can see the correct classes. Having said that, you'll probably want to use the PHP/Java extension to avoid initializing a JVM for each call to lucene. Try this: http://www.google.nl/search?q=php+java+org+apache+luceneie=UTF-8oe=UTF-8 This was one of the results: http://www.idimmu.net/index.php?blog%5Bpagenum%5D=3 Regards, Paul Elschot Op Friday 21 March 2008 21:24:37 schreef milu07: Hello, My machine is Ubuntu 7.10. I am working with Apache Lucene. I have done with indexer and tried with command line Searcher (the default command line included in Lucene package: http://lucene.apache.org/java/2_3_1/demo2.html). When I use this at command line: java Searcher -query algorithm it works and returns a list of results to me. Here 'algorithm' is the keyword to search. However, I want to have a web search interface written in PHP, I use PHP exec() to call this Searcher from my PHP script: exec(java Searcher -query algorithm , $arr, $retVal); [I also tried: exec(java Searcher -query 'algorithm' , $arr, $retVal)] It does not work. I print the value of $retVal, it is 1. I come back and try: exec(java Searcher -query algorithm 21 , $arr, $retVal); I receive: Exception in thread main java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer and $retVal is 1 In the command line Searcher.java of Lucene, it imports many libraries, is this the problem? import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyz er; I guess this is the problem of path. However, I do not know how to fix it because it works in command line ($CLASSPATH points to the .jar file of Lucene library). May be PHP does not know $CLASSPATH. So, I add Lucene lib to $PATH: export PATH=$PATH:/usr/lib/lucene-core-2.3.1.jar:/usr/lib However, I get the same error message when I try: exec(java Searcher -query algorithm 21 , $arr, $retVal); Exception in thread main java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer Could you please help? Thank you, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Call Lucene default command line Search from PHP script
Op Saturday 22 March 2008 00:32:32 schreef Paul Elschot: Milu, This is a PHP problem, not a Lucene one, so you might get better response at a PHP mailing list. The easy way around your problem is probably by invoking a shell script from php that exports the class path as you indicated, so that java can see the correct classes. I meant a shell script that exports the class path, and invokes java from the same shell. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HELP: how to list term score inside some document?
Op Friday 14 March 2008 17:28:17 schreef Rao WeiXiong: Dear: If possible to list all term scores inside some document by some simple method? now i just use each term as the query to search the whole index to get the score. seems very cumbersome. is there any simple approach? Have a look at Searcher.explain() Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser - BooleanClause.Occur
Op Friday 29 February 2008 18:04:47 schreef Donna L Gresh: I believe something like the following will do what you want: QueryParser parserTitle = new QueryParser(title, analyzer); QueryParser parserAuthor = new QueryParser(author, analyzer); BooleanQuery overallquery = new BooleanQuery(); BolleanQuery firstQuery = new BooleanQuery(); Query q1= parserTitle.parse(term1); Query q2= parserAuthor.parse(term1); firstQuery.add(q1, BooleanClause.Occur.SHOULD); //should is like an OR firstQuery.add(q2, BooleanClause.Occur.SHOULD); BolleanQuery secondQuery = new BooleanQuery(); Query q3= parserTitle.parse(term2); Query q4= parserAuthor.parse(term2); secondQuery.add(q3, BooleanClause.Occur.SHOULD); secondQuery.add(q4, BooleanClause.Occur.SHOULD); overallquery.add(firstQuery, BooleanClause.Occur.MUST); //must is like an AND overallquery.add(secondQuery, BooleanClause.Occur.MUST): There is no need for a QueryParser in this case when using a TermQuery instead of a Query for q1, q2, q3 and q4: TermQuery q1 = new TermQuery(new Term(title, term1)); Regards, Paul Elschot Donna Gresh JensBurkhardt [EMAIL PROTECTED] wrote on 02/29/2008 10:46:51 AM: Hey everybody, I read that it's possible to generate a query like: (title:term1 OR author:term1) AND (title:term2 OR author:term2) and so on. I also read that BooleanClause.Occur should help me handle this problem. But i have to admit that i totally don't understand how to use it. If someone can explain or has a link to an explanation, this would be terrific Thanks and best regards Jens Burkhardt -- View this message in context: http://www.nabble. com/MultiFieldQueryParser---BooleanClause.Occur-tp15761243p15761243 .html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to pass additional information into Similarity.scorePayload(...)
Hi Cedric, I think I'm beginning to get the point of the [10/5/2], and why you called that requirement a bit strange, see below. To use both normal position info and paragraph position info you'll need two separate, one normal, and one for the paragraphs. To use the normal field to determine the matches, and the paragraph field to determine the weightings of these matches the TermPositions of both fields will have to be advanced completely in sync. That is possible, but not really nice to do. If Lucene had multiple positions for an indexed term, it would be straightforward. But as long as that is not the case, you'll either have to advance the two TermPositions in sync, or use payloads with the paragraph numbers. Or you could relax the paragraph numbering requirement into a positional requirement, and use the modified SpanFirstQuery. That could be done by using an avarage paragraph length to determine the weight at the matching position. As this is easy to implement, I'd first implement this and try to sell it to the users :) At that marketing moment you might as well ask the users what they think of matches that cross paragraph borders. Do you already have a firm requirement for that case? SpanNotQuery can be used to prevent matches over paragraph borders when these are indexed as such, but I would not expect that you would need those, given the fuzzyness of the [10/5/2]. Regards, Paul Elschot Op Friday 15 February 2008 09:45:58 schreef Cedric Ho: Hi Paul, Do you mean the following? e.g. to index this: first second third paragraphBorder forth fifth six originally it would be indexed as: (first,0) (second,1) (third,2) (forth,3) (fifth,4) (six,5) now it will be: (first,0) (second,0) (third,0) (forth,1) (fifth,1) (six,1) Then those Query classes that depends on the positional information (PhraseQuery, SpanQueries) won't work then? unfortunately I'll need those Query classes as well. Cedric For each word in the input stream make sure that the position at which it is indexed in an extra field is the same as the paragraph number. That will involve only allowing a position increment at a paragraph border during indexing. Call this extra field the paragraph field if you will. Then, during search, search for a Term in paragraph field, and use the position from that field, i.e. the paragraph number to find a weight for the found term. Have a look at PhraseQuery on how to use term positions during search. It computes relative positions, but it works on the absolute positions that it gets from the index. SpanFirstQuery also allows to do that, it's a bit more involved, but in the end it works from the same absolute positions from the index. The version at the jira issue will even allow to use the length of the matching spans as the absolute paragraph number, which, in turn, allows the use of a Similarity for the paragraph weights [10/5/2]. There is nothing special about indexed term positions; any term can be indexed at any position in a field. Lucene will take advantage of the incremental nature of positions by storing only compressed differences of positions in the index, but during search the original positions are directly available, You can do the same with payloads, but why reimplement something that is already available? Payloads have better uses than positional info, for one they are great to avoid disjunctions. For example for verbs, one could index only the stem and use a payload for the actual inflected form (singular/plural, past/present, first/second/third person, etc). Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to pass additional information into Similarity.scorePayload(...)
I have no idea what the [10/5/2] means, so I can't comment on that. In case I have missed it previously I'm sorry. My point was that payloads need not be used for different position info. It's possible to do that, and it may be good for performance in some cases, but one can revert to using another field for different position info. Regards, Paul Elschot Op Thursday 14 February 2008 09:44:40 schreef Cedric Ho: Hi Paul, Sorry I am not sure I understand your solution. Because I would need to apply this scoring logic to all the different types of Queries. A search may consists of something like: +(term1 phrase2 wildcard*) +spanNear(term3 term4) [10/5/2] And this [10/5/2] ratio have to be applied to the whole search query before it. So I am not sure how would using just SpanFirstQuery with a separate field work in this situation. Anyway, I know my requirement is a bit strange, so it's ok if I can't do this in Lucene. I'll settle with using a ThreadLocal to store the [10/5/2] weighting and retrieve it in the Similarity.scorePayload(...) function. BTW, this problem I am facing now is different from the last one I asked here, which you have proposed with the Modified SpanFirstQuery solution =) But I am really grateful with all the helps I get here. Keep up the good work! Cheers, Cedric On Thu, Feb 14, 2008 at 2:58 PM, Paul Elschot [EMAIL PROTECTED] wrote: Op Thursday 14 February 2008 02:11:24 schreef Cedric Ho: I am using Lucene's Built-in query classes: TernQuery, PhraseQuery, WildcardQuery, BooleanQuery and many of the SpanQueries. The info I am going to pass in is just some weightings for different part of the indexed contents. For example if the payload indicate that a term is in the 2nd paragraph, then I'll take the weighting for the 2nd paragraph and multiply it by the score. So it seems without writing my own query there's no way to do it ? In case it is only positional information that is stored in the payload (i.e. some integer number that does not decrease when tokenizing the document), it is also possible to use an extra field and make sure the position increment for that field is only positive when the number (currently your payload) increases. A SpanFirstQuery on this extra field would almost do, and you will probably need https://issues.apache.org/jira/browse/LUCENE-1093 . This will be somewhat slower than using a payload, because the search will be done in two separate fields, but it will work. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to pass additional information into Similarity.scorePayload(...)
Op Friday 15 February 2008 02:47:14 schreef Cedric Ho: Sorry that I didn't make myself clear. [10/5/2] means for terms found in the 1st paragraph, give it score*10, for terms in the 2nd, give it score*5, etc. So I don't know how to do this scoring if the position (paragraph) information is in a separate field. For each word in the input stream make sure that the position at which it is indexed in an extra field is the same as the paragraph number. That will involve only allowing a position increment at a paragraph border during indexing. Call this extra field the paragraph field if you will. Then, during search, search for a Term in paragraph field, and use the position from that field, i.e. the paragraph number to find a weight for the found term. Have a look at PhraseQuery on how to use term positions during search. It computes relative positions, but it works on the absolute positions that it gets from the index. SpanFirstQuery also allows to do that, it's a bit more involved, but in the end it works from the same absolute positions from the index. The version at the jira issue will even allow to use the length of the matching spans as the absolute paragraph number, which, in turn, allows the use of a Similarity for the paragraph weights [10/5/2]. There is nothing special about indexed term positions; any term can be indexed at any position in a field. Lucene will take advantage of the incremental nature of positions by storing only compressed differences of positions in the index, but during search the original positions are directly available, You can do the same with payloads, but why reimplement something that is already available? Payloads have better uses than positional info, for one they are great to avoid disjunctions. For example for verbs, one could index only the stem and use a payload for the actual inflected form (singular/plural, past/present, first/second/third person, etc). Regards, Paul Elschot Cedric On Fri, Feb 15, 2008 at 7:15 AM, Paul Elschot [EMAIL PROTECTED] wrote: I have no idea what the [10/5/2] means, so I can't comment on that. In case I have missed it previously I'm sorry. My point was that payloads need not be used for different position info. It's possible to do that, and it may be good for performance in some cases, but one can revert to using another field for different position info. Regards, Paul Elschot Op Thursday 14 February 2008 09:44:40 schreef Cedric Ho: Hi Paul, Sorry I am not sure I understand your solution. Because I would need to apply this scoring logic to all the different types of Queries. A search may consists of something like: +(term1 phrase2 wildcard*) +spanNear(term3 term4) [10/5/2] And this [10/5/2] ratio have to be applied to the whole search query before it. So I am not sure how would using just SpanFirstQuery with a separate field work in this situation. Anyway, I know my requirement is a bit strange, so it's ok if I can't do this in Lucene. I'll settle with using a ThreadLocal to store the [10/5/2] weighting and retrieve it in the Similarity.scorePayload(...) function. BTW, this problem I am facing now is different from the last one I asked here, which you have proposed with the Modified SpanFirstQuery solution =) But I am really grateful with all the helps I get here. Keep up the good work! Cheers, Cedric On Thu, Feb 14, 2008 at 2:58 PM, Paul Elschot [EMAIL PROTECTED] wrote: Op Thursday 14 February 2008 02:11:24 schreef Cedric Ho: I am using Lucene's Built-in query classes: TernQuery, PhraseQuery, WildcardQuery, BooleanQuery and many of the SpanQueries. The info I am going to pass in is just some weightings for different part of the indexed contents. For example if the payload indicate that a term is in the 2nd paragraph, then I'll take the weighting for the 2nd paragraph and multiply it by the score. So it seems without writing my own query there's no way to do it ? In case it is only positional information that is stored in the payload (i.e. some integer number that does not decrease when tokenizing the document), it is also possible to use an extra field and make sure the position increment for that field is only positive when the number (currently your payload) increases. A SpanFirstQuery on this extra field would almost do, and you will probably need https://issues.apache.org/jira/browse/LUCENE-1093 . This will be somewhat slower than using a payload, because the search will be done in two separate fields, but it will work. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
Re: How to pass additional information into Similarity.scorePayload(...)
Op Thursday 14 February 2008 02:11:24 schreef Cedric Ho: I am using Lucene's Built-in query classes: TernQuery, PhraseQuery, WildcardQuery, BooleanQuery and many of the SpanQueries. The info I am going to pass in is just some weightings for different part of the indexed contents. For example if the payload indicate that a term is in the 2nd paragraph, then I'll take the weighting for the 2nd paragraph and multiply it by the score. So it seems without writing my own query there's no way to do it ? In case it is only positional information that is stored in the payload (i.e. some integer number that does not decrease when tokenizing the document), it is also possible to use an extra field and make sure the position increment for that field is only positive when the number (currently your payload) increases. A SpanFirstQuery on this extra field would almost do, and you will probably need https://issues.apache.org/jira/browse/LUCENE-1093 . This will be somewhat slower than using a payload, because the search will be done in two separate fields, but it will work. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: recall/precision with lucene
Op Saturday 09 February 2008 01:59:12 schreef Panos Konstantinidis: Hello I am a new lucene user. I am trying to calculate the recall/precision of a query and I was wondering if lucene provides an easy way to do it. Currently I have a number of documents that match a given query. Then I am doing a search and I am getting back all the Hits. I then divide the number of documents that came back from lucene (the Hits size) with the number of documents that should have got. This is how I calculate the recall. Since you're going to use all hits for the query, it is normally better to avoid Hits and use a HitCollector or a TopDocs. For precision I just get the hits.score() of each relevant document. I am not sure if I am on the right track or if there is an easier/better way to do it. I would appreciate any insigith into this. To use the score value for precision one could define a cut off value for the score value, but then the calculation for recall would also need to be adapted. For this a HitCollector would be good. In case you want the results sorted by decreasing score value have a look at the search methods that return TopDocs. From this one can make a precision/recall graph for the query by considering the total results higher than a given score. When a lot of such computations are needed, you may also want to cache the values of a unique identifier field for all indexed docs, have a look at FieldCache for this. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene syntax query matched against a string content
Without using a RAMDirectory index it would be necessary to implement all Scorers used by the query directly top of the token stream that normally goes into the index. This is possible, but Lucene is not designed to do this, so it won't be easy. But especially for more preparsed queries against a small set of new documents, this might be nice to have. Still, even for that case, it would only gain performance over using RAMDirectory when the queries can be evaluated from the ground up, sharing as many subqueries as possible. And that is just the opposite of the top down way query search is currently implemented on a prebuilt index. The basic design for this would be to start from a set of queries to be 'analyzed' to make them share as many subqueries as possible, building a query graph. Then this query graph would be fed the new documents one by one, resulting in a score for each matching query that was added to the query graph. It is possible, but it would be quite a bit of work. And then someone will come along with the requirement to match an existing index against such a query graph, which is not a bad idea either, but it might need yet another way of collecting the results. Regards, Paul Elschot Op Friday 08 February 2008 05:48:08 schreef Nilesh Bansal: Hi, I want to create a function, which takes in a query string (in lucene syntax), and a string as content and returns back if the query matches the content or not. This would mean, query = +(apache) +(lucene OR httpd) will match content = HTTPD by Apache foundation is one of the most popular open source projects and will not match content = Lucene and httpd are projects from same open source foundation Basically, I need to fill in the contents of the following Java function. This should be easy to do, but I don't know how. I obviously don't want to create a dummy lucene index in memory with a single document and then search for the query against that (for performance reasons). public static boolean isRelevant(String luceneQuery, String contents) { // TODO fill in } Instead of boolean, it could return a relevance score, which will be zero if the query is not relevant to the document. Any help will be appreciated. thanks Nilesh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene to index OCR text
Op Tuesday 29 January 2008 03:32:08 schreef Daniel Noll: On Friday 25 January 2008 19:26:44 Paul Elschot wrote: There is no way to do exact phrase matching on OCR data, because no correction of OCR data will be perfect. Otherwise the OCR would have made the correction... snip suggestion to use fuzzy query The problem I see with a fuzzy query is that if you have the fuzziness set to 1, then fat will match mat. But in reality, f and m don't get confused with OCR. What you really want is for a given term to expand to a boolean query of all possible misidentified alternatives. For that you would first need to figure out which characters are often misidentified as others, which can probably be achieved by going over a certain number of documents and manually checking which letters are wrong. This should provide slightly more comprehensive matching without matching terms which are obviously different to the naked eye. It's also possible to select the fuzzy terms by their document frequency, and reject all that have a ((quite) a bit) higher doc frequency than the given term. Combined with a query proximity to another similarly queried term this can work reasonably well. For query search performance selecting only low frequency terms is nice, as it avoids searching for high frequency terms. Btw, this use of a worse spelling is more or less the opposite of suggesting a better spelling from terms with a higher doc frequency. What would be ideal is if an analyser could do this job (a looks like analyser, like how SoundEx is a sounds like analyser.) But I get the feeling that this would be very difficult. Shame the OCR software can't store this information, e.g. 80% odds that this character is a t but 20% odds that it's an f. If you had that for every character it would be very useful... Ah yes, the ideal world. Is there OCR software that provides such details? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene to index OCR text
Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell: I've been poking around the list archives and didn't really come up against anything interesting. Anyone using Lucene to index OCR text? Any strategies/algorithms/packages you recommend? I have a large collection (10^7 docs) that's mostly the result of OCR. We index/search/etc. with Lucene without any trouble, but OCR errors are a problem, when doing exact phrase matches in particular. I'm looking for ideas on how to deal with this thorny problem. How about Letter-by-letter ngrams coupled with SpanQueries (or more likely, a custom query utilizing the TermPositions iterator)? There is no way to do exact phrase matching on OCR data, because no correction of OCR data will be perfect. Otherwise the OCR would have made the correction... What you'll need is something like a fuzzy query as the leafs of a phrase query. Also, there may be missing word boundaries, and in that case you'll have to use a truncation query instead of a phrase query. The more fuzzyness introduced in the query, the higher the chance of false matches, so there really is no single answer to this. It depends on how many false matches the users will accept and on how many OCR errors there are. One could start by adding some fuzzy term matching to phrase query, and see what the users think of that. They will lose some performance, and that is another factor in the fuzzyness tradeoff. SpanQueries could be used too, for these a fuzzy term match would need to be added, as well as a query parser. When adding fuzzy term matching to a phrase query looks to be a bit daunting, have a look at the surround query parser in the contrib area. It has truncation and proximity based on span queries, but no fuzzy term matching, so it could also be a start for investigating. It all depends on how good the OCR was, but in some cases (think old paper) it's just not possible to do good OCR. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Performance
On Friday 18 January 2008 17:52:27 Thibaut Britz wrote: Hi, ... Another thing I noticed is that we append a lot of queries, so we have a lot of duplicate phrases like (A and B or C) and ... and (A and B or C) (more nested than that). Is lucene doing any internal query optimization (like karnaugh maps) by removing the last (A and B or C), as it is not needed, or do I have to do that myself? Query optimization like karnaugh maps is not available in Lucene. For each level of 'and' and 'or' in the (rewritten) query, as well as for all terms in the query, a separate scorer will be used during query search. The query rewrite could in principle do this, but it might affect the score values. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Self Join Query
Sachin, As the merging of the results is the issue, I'll assume that you don't have clear user requirements for that. The simplest way out of that is to allow the users to search the B's first, and once they have determined which B's they'd like to use, use those B's to limit the results in of user searches in A. That would normally be done by a filtering on B, much like RangeFilter. Caching that filter allows for quick repeated searches in A. Is that what the users want? For each normalization a filter can be used to search across it. One feature of filters is that the original score is lost. Would you have user requirements related to this? As the texts of A and B are the problem for reindexing, you may want to index these separately: one index for Aid+Atext, and one for Bid+Btext. That leaves the A-B 1-n association: one more index for Aid+Bids. In this last one you could also put a small text field of A. Denormalizing the Btext into Aid+Bids as Aid+Bids+Btexts can make it difficult for the users to explicitly select the B's. OTOH it makes it easy to implicitly select the B's. What do the users want? Each id field will have to be indexed to allow filtering, and stored to allow retrieval for filtering in another index. Retrieving stored fields is normally a performance bottleneck, so a FieldCache might be handy. Regards, Paul Elschot On Thursday 10 January 2008 12:58:44 sachin wrote: Here are more details about my issue. I have two tables in database. A row in table 1 can have multiple rows associated with it in table 2. It is a one to many mapping. Let's say a row in table 1 is A and it has multiple rows B1, B2 and B3 associated with it in table 2. I need to search on both A and B types and the result should have A and all the Bs associated with it. Also for your information, A and Bs are long text in database. I could have two approaches for indexing/searching First approach is to create the index in denormalized form. In this case document would be like A, B1, B2, B3. The issue with this approach is that any modification to any row would require me to re-index the document again and fetch A and all Bs again from database. This is a heavy process. The other approach is to index A, B1, B2 and B3 in different documents and after search merge the results. This makes my re-indexing lighter but I need to put extra logic to merge the results. For this type of index I would require self join kind of query from lucene. Query can be written by using boolean query but merging of two type of documents is a issue. If I go by this approach for indexing, what is the best way to fetch the results? I hope I have made myself clear. Thanks Sachin On Tue, 2008-01-08 at 20:13 +0530, Developer Developer wrote: Provide more details please. Can you not use boolean query and filters if need be ? On Jan 8, 2008 7:23 AM, sachin [EMAIL PROTECTED] wrote: I need to write lucene query something similar to SQL self joins. My current implementation is very primitive. I fire first query, get the results, based on the result of first query I fire second query and then merge the results from both the queries. The whole processing is very expensive. Doing this is very easy with SQL query as we need to just write self join query and database do the rest for you. What is the best way of implementing the above functionality in lucene? Regards Sachin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query processing with Lucene
On Tuesday 08 January 2008 22:49:18 Doron Cohen wrote: This is done by Lucene's scorers. You should however start in http://lucene.apache.org/java/docs/scoring.html, - scorers are described in the Algorithm section. Offsets are used by Phrase Scorers and by Span Scorer. That is for the case that offsets were meant to be positions within a document. It is also possible that offsets were meant in the sense of using skipTo(doc) instead of next() on a Scorer. This is done during query search when at least one term is required. Regards, Paul Elschot Doron On Jan 8, 2008 11:24 PM, Marjan Celikik [EMAIL PROTECTED] wrote: Doron Cohen wrote: Hi Marjan, Lucene process the query in what can be called one-doc-at-a-time. For the example query - x y - (not the phrase query x y) - all documents containing either x or y are considered a match. When processing the query - x y - the posting lists of these two index terms are traversed, and for each document met on the way, a score is computed (taking into account both terms), and collected. At the end of the traversal, usually best N collected docs are returned as search result. So, this is an exhaustive computation creating a union of the two posting. For the query - +x +y - in intersection rather than union is required, and the way Lucene does it is again to traverse the two posting lists, just that only documents seen in both lists are scored and collected. This allows to optimize the search, skipping large chunks of the posting lists, especially when one term is rarer than the other. Thank you for your answer. I am having trouble finding the function which traverses the documents such that they get scored. Can you please tell me where the posting lists (for a +x +y query) get intersected after they get read (by next() I guess) from the index? In particular, I am interested in how does Lucene get the new positions (offsets) of the documents seen in both posting lists, i.e. positions (in a document) for the query word x, and positions for the query word y. Thank you in advance! Marjan. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can I do boosting based on term postions?
On Tuesday 18 December 2007 14:59:45 Peter Keegan wrote: Should I open a Jira issue? What shall I say? http://www.apache.org/foundation/how-it-works.html Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field weights
Karl, This might work for you: https://issues.apache.org/jira/browse/LUCENE-293 Regards, Paul Elschot On Friday 14 December 2007 18:06:01 Karl Wettin wrote: I have an index that contains three sorts of documents: Car brand Tire brand Tire pressure (Please bear with me, the real index has nothing to do with cars. I just try to explain the problem in an alternative domain to avoid NDA conflicts.) There is a heirarchial composite relationship between these sort of documents. A document describing tire pressure also contains tire brand and car brand. A document describing tire brand also contains information about car brand. A document describing car brand contains only that. The requirement is that the consumer of the API should not have to specify what fields they are searching in. There is no time (nor training data) to implement a hidden markov model (HMM) tokenizer or something along that path in order to extract possible attributes from the query string. Instead the query string is tokenized once per field and assebled to one huge query. Normally this works fairly well. Here are some example documents: Volvo Volvo, Michelin Volvo, Nokian Volvo, Nokian, 2.2 bars Volvo, Firestone, 2.4 bars Saab Saab, Michelin Saab, Nokian Saab, Nokian, 2.1 bars Saab, Firestone Saab, Firestone, 2.4 bars Saab, Firestone, 2.5 bars If I search for Saab the top result will be the document representing the car brand Saab. The query would look like this: car:saab tire:saab preasure:saab But lets say Saab starts manufacturing tires too: Saab Saab, Saab tires Saab, Saab tires, 1.9 bars Saab, Saab tires, 1.8 bars If I search for Saab I still want the top result to be Saab the car brand. But it not longer is, the match for Saab, Saab tires now have a greater score than Saab, of course. My idea is to work along the line of indexing Saab in the tire brand and tire pressure field too. Now searching for Saab will yeild a result where the car brand Saab is the top result. However, this will not work as I have different tokenization strategies for each field (stemming and what not). Tokenizing the query string Saab for the field tire brand in Swedish might end up as saa and will thus not find the token Saab inserted for the document describing the car brand Saab. I have a couple of experiments in my head I need to try out, starting with tokezining query strings per field and using the tokens generated for the field car brand as query in the tire brand and tire pressure too. And vice versus. Any brilliant ideas that might work? Hacky solutions are OK. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scoring for all the documents in the index relative to a query
Gentlefolk, Well, the javadocs as patched at LUCENE-584 try to change all the cases of zero scoring to 'non matching'. I'm happily bracing for a minor conflict with that patch. In case someone wants to take another look at the javadocs as patched there, don't let me stop you... Regards, Paul Elschot On Monday 19 November 2007 23:35:07 Yonik Seeley wrote: On Nov 19, 2007 5:03 PM, Chris Hostetter [EMAIL PROTECTED] wrote: (I'm not actually sure how the Hits class treats negative values All Lucene search methods except ones that take a HitCollector filter out final scores = 0 Solr does allow scores =0 through since it had different collection methods to avoid score normalization (back when Lucene still did it for TopDocs). -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search performance using BooleanQueries in BooleanQueries
On Tuesday 06 November 2007 23:14:01 Mike Klaas wrote: On 29-Oct-07, at 9:43 AM, Paul Elschot wrote: On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote: +prop1:a +prop2:b +prop3:c +prop4:d +prop5:e is much faster than (+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +prop5:e) where the second one is a result from BooleanQuery in BooleanQuery, and all have Occur.MUST. SImplifying boolean queries like this is not available in Lucene, but it would have a positive effect on search performance, especially when prop1:a and prop2:b have a high document frequency. Wait--shouldn't the outer-most BooleanQuery provide most of this speedup already (since it should be skipTo'ing between the nested BooleanQueries and the outermost). Is it the indirection and sub- query management that is causing the performance difference, or differences in skiptTo behaviour? The usual Lucene answer to performance questions: it depends. After every hit, next() needs to be called on a subquery before skipTo() can be used to find the next hit. It is currently not defined which subquery will be used for this first next(). The structure of the scorers normally follows the structure of the BooleanQueries, so the indirection over the deep subquery scores could well be relevant to performance, too. Which of these factors actually dominates performance is hard to predict in advance. The point of skipTo() is that is tries to avoid disk I/O as much as possible for the first time that the query is executed. Later executions are much more likely to hit the OS cache, and then the indirections will be more relevant to performance. I'd like to have a good way to do a performance test on a first query execution, in the sense that it does not hit the OS cache for its skipTo() executions, but I have not found a good way yet. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 2/3 of terms matched + coverage filter
On Wednesday 31 October 2007 14:51:12 Tobias Hill wrote: My documents all hava a field with variables number of terms (but rather few): Doc1.field = foo bar gro Doc2.field = foo bar gro mot slu Now I would like to search using the terms foo bar gro Problem 1: I like to express that at least any two of the three terms must match. Do I have to construct this clause myself - i.e. (foo bar) | (foo gro) | (bar gro), or is there some clever way to do this? BooleanQuery.setMinimumNumberShouldMatch(int) does this, have a look at the javadocs for the details. Problem 2: I like to express that if the doc.field has too many terms that wasn't matched it should not be included at all in the result. In the example above Doc2 might have too many terms that was not matched to be included in the result. Is this kind of query possible, and how? The general case: I want to find those docs that has X% of the search terms matched and that the acctual match covers at least Y% of the available terms on the document. This Y% is not directly possible, but I would expect the default document score to correlate reasonably well with coverage. In case you want an exact Y% cutoff, you'll run into the fact that the field norm (the inverse square root of the field length) is encoded in only 8 bits, which is rather course. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Looking for Exact match but no other terms... how to express it?
On Tuesday 30 October 2007 16:58:09 Tobias Hill wrote: I want to match on the exact phrase foo bar dot on a specific field on my set of documents. I only want results where that field has exactly foo bar dot and no more terms. I.e. A document with foo bar dot alu should not match. A phrase query with slop 0 seems resonable but how do I express but nothing more than these terms. Another way to do this is by indexing a special begin and end token before and after the tokens of the field, and by extending your queries with these special tokens, for example: =begin= foo bar dot =end= . Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search performance using BooleanQueries in BooleanQueries
On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote: Hello, I am seeing that a query with boolean queries in boolean queries takes much longer than just a single boolean query when the number of hits if fairly large. For example +prop1:a +prop2:b +prop3:c +prop4:d +prop5:e is much faster than (+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +prop5:e) where the second one is a result from BooleanQuery in BooleanQuery, and all have Occur.MUST. Is there a way to detect and rewrite the second inefficient query? query.rewrite() does not change the query AFAICS. SImplifying boolean queries like this is not available in Lucene, but it would have a positive effect on search performance, especially when prop1:a and prop2:b have a high document frequency. You could write this yourself, for example by overriding BooleanQuery.rewrite(). Take care about query weights, though. Regards, Paul Elschot thanks for any help, Regards Ard - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Cache BitSet or doc number?
Have a look at decoupling Filter from BitSet: http://issues.apache.org/jira/browse/LUCENE-584 There also is a SortedVIntList there that stores document numbers more compactly than BitSet, and an implementation of CachingFilterQuery (iirc) that chooses the more compact representation of BitSet and SortedVIntList. Regards, Paul Elschot On Saturday 27 October 2007 02:15:48 Yonik Seeley wrote: On 10/26/07, John Patterson [EMAIL PROTECTED] wrote: Thom Nelson wrote: Check out the HashDocSet from Solr, this is the best way to cache small sets of search results. In general, the Solr BitSet/DocSet classes are more efficient than using the standard java.util.BitSet. You can use these independent of the rest of Solr (though I recommend checking out Solr if you want to do complex caching). I imagine the fastest way to combine cached results is to store them in an array ordered by doc number so that the ConjunctionQuery can use them directly. The Javadoc for HashDocSet says that they are stored out of order which would make this impossible. You're speaking at quite an abstract level... it really depends on what specific issue you are seeing that you're trying to solve. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Adding support for NOT NEAR construct?
Dave, One can use SpanNotQuery to get NOT NEAR by using this generalized structure: SpanNot(foo, SpanNear(foo, bar, distance)) This also allows for example: SpanNot(two, SpanNear(one, three, distance)) Btw. I don't know of any query language that has this second form. AND NOT normally does not work for this because it works on doc level and not within the matching text of a field. Regards, Paul Elschot On Wednesday 17 October 2007 17:57:21 Dave Golombek wrote: We've run into a situation where having NOT NEAR queries would really help. I haven't been able to find any discussion of adding this to Lucene in the past, so wanted to ask if people had any comments about it before I started trying to make the change. I've looked at NearSpansUnordered and it seems that reversing the logic in atMatch() would go a long way towards implementation; NearSpansOrdered is a bit harder, depending upon the exact semantics of NOT NEAR that we want to implement. For queries, I was thinking that either foo bar~-10 or foo bar!~10 might be reasonable; the former should be pretty easy to parse. Does this sound reasonable? Something for contrib? Thanks, Dave Golombek Senior Software Engineer Black Duck Software, Inc. [EMAIL PROTECTED] T +1.781.810.2079 F +1.781.891.5145 C +1.617.230.5634 http://www.blackducksoftware.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scoring a single document from a corpus based on a given query
On Wednesday 10 October 2007 18:44, lucene_user wrote: I would like to score a single document from a corpus based on a given query. The formula score(q,d) is basically what I am looking for. Pseudo Code of Something Close to what I am looking for: indexReader.score(query, documentId); The formula score(q,d) is used throughout the documentation to describe similarity but there does not seem to be a corresponding java method. I could work around the issue by applying a search filter to only consider the particular document I am looking for. I was hoping for a cleaner approach. You can try this: Explanation e = indexSearcher.explain(query, documentId); and get the score value from the explanation. Have a look at the code of any Scorer.explain() method on how to get the score value only. There really is no need to filter in this case. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scorer skipTo() expectations?
Dan, In Scorers, when skipTo() or next() returns true for the second or later time, the result of doc() will be increased. When Scorer.skipTo() does not have document order, documents will be lost, which means that not all matching documents will be found by the search. For disjunctions (OR), one needs to merge the documents of two Scorers using next() to iterate over the documents. The merging is normally done on the fly using a specialized priority queue on the doc() values in DisjunctionSumScorer. No sorting of complete document lists is done at search time, that is done at indexing time. And since TermScorer uses the index directly, it will always return documents in order. The only exception to document ordering is BooleanScorer.next(), which is used by BooleanQuery for some cases of top level disjunctions, and then only when documents are allowed to be scored out of order. The reason for that is performance, BooleanScorer uses a faster data structure than a priority queue, but BooleanScorer does not implement skipTo(). Regards, Paul Elschot On Thursday 04 October 2007 09:12, Dan Rich wrote: Hi, I have a custom Query class that provides a long list of lucene docIds (not for filtering purposes), which is one clause in a standard BooleanQuery (which also contains TermQuery instances). I have a custom Scorer that goes along with the custom Query class. What (if any) document ordering requirements does the Scorer class have for its skipTo(int docId) method? In particular, currently I'm sorting/returning the docIds in ascending order from my custom Query class. That can be expensive for large docId lists; is sorting necessary? It looks like skipTo() might expect the documents it gets to be in ascending order to behave correctly as part of a BooleanQuery, but I can't tell for sure from the doc. If the document list from my custom Scorer class does not have its document list in ascending order (e.g. 10, 80, 40, 60, 50) will whatever uses skipTo() potentially lose hits? If not, is there any performance concern with having the docIds unordered? ___ _ Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: a query for a special AND?
As for suggestions on how to do this, I have no other than to make sure that you can create the queries necessary to obtain the required output. Regards, Paul Elschot On Sunday 30 September 2007 09:20, Mohammad Norouzi wrote: Hi Paul, thanks, I dot your idea, now I am planing to implement this, de-normalization, now I just need your suggestion on this issue and tell me which one is the best I am considering to put a Field as follows: my_de_normalized_field service_name : service_value if there are more than one service and value then I need to separate with a character such as comma service_name1:value1 , service_name2: value2, apart from that, each value might be a range value say, 10 - 20 or a single value say, -2 do you have any suggestion on this? thank you very much On 9/20/07, Paul Elschot [EMAIL PROTECTED] wrote: On Thursday 20 September 2007 09:19, Mohammad Norouzi wrote: well, you mean we should separate documents just like relational tables in databases ? Quite the contrary, it's called _de_normalization. This means that the documents in lucene normally contain more information than is present in a single relational entity. if yes, how to make the relationship between those documents Lucene has no facilities to maintain relational relationships among its documents. A lucene index allows free format documents, i.e. any document may have any field or not. In practice you will need at least a primary key, but even that you will need to program yourself. Regards, Paul Elschot thank you so much Paul On 9/20/07, Paul Elschot [EMAIL PROTECTED] wrote: On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote: Sorry Paul I just hurried in replying ;) I read the documents of Lucene about query syntax and I figured out the what is the difference but my problem is different, this is preoccupied my mind and I am under pressure to solve this problem, after analyzing the results I get, now I think we need a group by in our query. let me tell you an example: we need a list of patients that have been examined by certain services specified by the user , say service one and service two. in this case here is the correct result: patient-id service_name patient_result 1 s112 1 s213 2 s1 41 2 s222 but for example, following is incorrect because patient 1 has no service with name service2: patient-id service_name patient_result 1 s112 1 s313 That depends on what you put in your lucene documents. You can only get complete lucene documents as query results. For the above example a patient with all service names should be indexed in a single lucene doc. The rows above suggest that the relation between patient and service forms the relational result. However, for a text search engine it is usual to denormalize the relational records into indexed documents, depending on the required output. Regards, Paul Elschot On 9/20/07, Mohammad Norouzi [EMAIL PROTECTED] wrote: Hi Paul, would you tell me what is the difference between AND and + ? I tried both but get different result with AND I get 1777 documents and with + I get nearly 25000 ? On 9/17/07, Paul Elschot [EMAIL PROTECTED] wrote: On Monday 17 September 2007 11:40, Mohammad Norouzi wrote: Hi I have a problem in getting correct result from Lucene, consider we have an index containing documents with fields field1 and field2 etc. now I want to have documents in which their field1 are equal one by one and their field2 with two different value to clarify consider I have this query: field1:val* (field2:myValue1 XOR field2:myValue2) Did you try this: +field1:val* +field2:myValue1 +field2:myValue2 Regards, Paul Elschot now I want this result: field1 field2 val1myValue1 val1myValue2 val2myValue1 val2myValue2 this result is not acceptable: val3 myValue1 or val4 myValue1 val4 myValue3 I put XOR as operator
Re: Translating Lucene Query Syntax to Traditional Boolean Syntax
On Tuesday 25 September 2007 03:05, Martin Bayly wrote: We have an application that performs searches against a Lucene based index and also against a Windows Desktop Search based index. For simple queries we'd like to offer our users a consistent interface that allows them to build basic Lucene style queries using the 'MUST HAVE' (+), 'MUST NOT HAVE' (-) and 'SHOULD HAVE' style of operators as this is probably more intuitive for non 'Boolean Logic' literate users. We would not allow them to use any grouping (parenthesis). Clearly we can pass this directly to Lucene, but for the Windows Desktop Search we need to translate the Lucene style query into a more traditional Boolean query. So this is the opposite of the much discussed Boolean Query to Lucene Query conversion. I'm wondering if anyone has ever done this or whether there is a concept mismatch in there somewhere that will make it difficult to do? My thought was that you could take the standard Lucene operators and simply group them together as follows: e.g. (assuming the Lucene default OR operator) Lucene: +a +b -c -d e f would translate to: (a AND b NOT c NOT d) OR (a AND b NOT c NOT d AND (e OR f)) If I put this back into Lucene (actually Lucene.NET but hopefully its the same) I get back: (+a +b -c -d)(+a +b -c -d +(e f)) which I think is equivalent but not as concise! But I have not tested this against a big index to see if it's equivalent and I have a suspicion that Lucene might score the two versions of the Lucene representation differently. But that's probably not an issue provided the Boolean representation is semantically equivalent to the first Lucene representation. Anyone ever tried this before or have any comments on whether my 'logic' is flawed! Under the hood, the Scorer for a BooleanQuery, BooleanScorer2, does the conversion from + and - to boolean operators in slightly different, but more concise way. It basically maps the boolean query syntax to four operators: AND, OR, ANDNOT, ANDoptional. The mapping is only basical because, as a Scorer, it needs to map to other Scorers, and these are: ConjunctionScorer, DisjunctionSumScorer, ReqExclScorer and ReqOptSumScorer, respectively. For ANDoptional the first subquery is required, and the second one is optional. This is a bit like ANDNOT in which the first query is required, and the second one is prohibited. The addition of ANDoptional/ReqOptSumScorer allows the conciseness, while keeping equivalent semantics. So I think your logic is not flawed, it's just that the traditional set of boolean operators is somehow incomplete for optional subqueries. Mapping + and - to these four operators is not really straightforward, among others because of the coordination factor, and because of the many different possible situations. You're invited to have a look at the source code of BooleanScorer2. There are also test cases for the equivalence of the semantics, see the TestBoolean* classes. Regards, Paul Elschot P.S. When documents may be scored out of order, for some disjunctions (OR), BooleanScorer is used for performance. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: a query for a special AND?
On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote: Sorry Paul I just hurried in replying ;) I read the documents of Lucene about query syntax and I figured out the what is the difference but my problem is different, this is preoccupied my mind and I am under pressure to solve this problem, after analyzing the results I get, now I think we need a group by in our query. let me tell you an example: we need a list of patients that have been examined by certain services specified by the user , say service one and service two. in this case here is the correct result: patient-id service_name patient_result 1 s112 1 s213 2 s1 41 2 s222 but for example, following is incorrect because patient 1 has no service with name service2: patient-id service_name patient_result 1 s112 1 s313 That depends on what you put in your lucene documents. You can only get complete lucene documents as query results. For the above example a patient with all service names should be indexed in a single lucene doc. The rows above suggest that the relation between patient and service forms the relational result. However, for a text search engine it is usual to denormalize the relational records into indexed documents, depending on the required output. Regards, Paul Elschot On 9/20/07, Mohammad Norouzi [EMAIL PROTECTED] wrote: Hi Paul, would you tell me what is the difference between AND and + ? I tried both but get different result with AND I get 1777 documents and with + I get nearly 25000 ? On 9/17/07, Paul Elschot [EMAIL PROTECTED] wrote: On Monday 17 September 2007 11:40, Mohammad Norouzi wrote: Hi I have a problem in getting correct result from Lucene, consider we have an index containing documents with fields field1 and field2 etc. now I want to have documents in which their field1 are equal one by one and their field2 with two different value to clarify consider I have this query: field1:val* (field2:myValue1 XOR field2:myValue2) Did you try this: +field1:val* +field2:myValue1 +field2:myValue2 Regards, Paul Elschot now I want this result: field1 field2 val1myValue1 val1myValue2 val2myValue1 val2myValue2 this result is not acceptable: val3 myValue1 or val4 myValue1 val4 myValue3 I put XOR as operator because this is not a typical OR, it's different, it means documents that contains both myValue1 and myValue2 for the field field2 how to build a query to get such result? thanks in advance -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/ Sun Certified Java Programmer ExpertsExchange Certified, Master: http://www.experts-exchange.com/M_1938796.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/ Sun Certified Java Programmer ExpertsExchange Certified, Master: http://www.experts-exchange.com/M_1938796.html -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/ Sun Certified Java Programmer ExpertsExchange Certified, Master: http://www.experts-exchange.com/M_1938796.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: a query for a special AND?
On Thursday 20 September 2007 09:19, Mohammad Norouzi wrote: well, you mean we should separate documents just like relational tables in databases ? Quite the contrary, it's called _de_normalization. This means that the documents in lucene normally contain more information than is present in a single relational entity. if yes, how to make the relationship between those documents Lucene has no facilities to maintain relational relationships among its documents. A lucene index allows free format documents, i.e. any document may have any field or not. In practice you will need at least a primary key, but even that you will need to program yourself. Regards, Paul Elschot thank you so much Paul On 9/20/07, Paul Elschot [EMAIL PROTECTED] wrote: On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote: Sorry Paul I just hurried in replying ;) I read the documents of Lucene about query syntax and I figured out the what is the difference but my problem is different, this is preoccupied my mind and I am under pressure to solve this problem, after analyzing the results I get, now I think we need a group by in our query. let me tell you an example: we need a list of patients that have been examined by certain services specified by the user , say service one and service two. in this case here is the correct result: patient-id service_name patient_result 1 s112 1 s213 2 s1 41 2 s222 but for example, following is incorrect because patient 1 has no service with name service2: patient-id service_name patient_result 1 s112 1 s313 That depends on what you put in your lucene documents. You can only get complete lucene documents as query results. For the above example a patient with all service names should be indexed in a single lucene doc. The rows above suggest that the relation between patient and service forms the relational result. However, for a text search engine it is usual to denormalize the relational records into indexed documents, depending on the required output. Regards, Paul Elschot On 9/20/07, Mohammad Norouzi [EMAIL PROTECTED] wrote: Hi Paul, would you tell me what is the difference between AND and + ? I tried both but get different result with AND I get 1777 documents and with + I get nearly 25000 ? On 9/17/07, Paul Elschot [EMAIL PROTECTED] wrote: On Monday 17 September 2007 11:40, Mohammad Norouzi wrote: Hi I have a problem in getting correct result from Lucene, consider we have an index containing documents with fields field1 and field2 etc. now I want to have documents in which their field1 are equal one by one and their field2 with two different value to clarify consider I have this query: field1:val* (field2:myValue1 XOR field2:myValue2) Did you try this: +field1:val* +field2:myValue1 +field2:myValue2 Regards, Paul Elschot now I want this result: field1 field2 val1myValue1 val1myValue2 val2myValue1 val2myValue2 this result is not acceptable: val3 myValue1 or val4 myValue1 val4 myValue3 I put XOR as operator because this is not a typical OR, it's different, it means documents that contains both myValue1 and myValue2 for the field field2 how to build a query to get such result? thanks in advance -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/ Sun Certified Java Programmer ExpertsExchange Certified, Master: http://www.experts-exchange.com/M_1938796.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/ Sun Certified Java Programmer ExpertsExchange Certified, Master: http://www.experts-exchange.com/M_1938796.html -- Regards, Mohammad
Re: a query for a special AND?
On Monday 17 September 2007 11:40, Mohammad Norouzi wrote: Hi I have a problem in getting correct result from Lucene, consider we have an index containing documents with fields field1 and field2 etc. now I want to have documents in which their field1 are equal one by one and their field2 with two different value to clarify consider I have this query: field1:val* (field2:myValue1 XOR field2:myValue2) Did you try this: +field1:val* +field2:myValue1 +field2:myValue2 Regards, Paul Elschot now I want this result: field1 field2 val1myValue1 val1myValue2 val2myValue1 val2myValue2 this result is not acceptable: val3 myValue1 or val4 myValue1 val4 myValue3 I put XOR as operator because this is not a typical OR, it's different, it means documents that contains both myValue1 and myValue2 for the field field2 how to build a query to get such result? thanks in advance -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/ Sun Certified Java Programmer ExpertsExchange Certified, Master: http://www.experts-exchange.com/M_1938796.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Span queries and complex scoring
Cedric, In case your requirements allow this, try and use subclass of Spans that has a score() method that returns a value that is used together with the other span info to provide a score value to your own SpanScorer at the top level. This score value can summarize the influence of the individual span scores of the subqueries. For this you will need to change the whole span package, but it is somewhat simpler than using a complete Scorer for each SpanQuery in the query tree. With a lot of nested SpanOrQueries, merging the Spans can become a performance bottleneck. The current situation can be improved by creating a specialized PriorityQueue for Spans, much like the ScorerDocQueue that is used by DisjunctionSumScorer. With this, it is possible to avoid SpanOrQuery by using term payloads to compute the score value for the Spans of a SpanTermQuery, but iirc the payloads are not yet in the trunk. Regards, Paul Elschot On Tuesday 11 September 2007 16:17, melix wrote: Hi, I'm working on an application which requires a complex scoring (based on semantics analysis). The scoring must be highly configurable, and I've found ways to do that, but I'm facing a discrete but annoying problem. All my queries are, basically, complex span queries. I mean for example a SpanNearQuery which embeds a SpanOrQuery which itself may embed another SpanNearQuery etc... I've followed the instructions at http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/package-summary.html#changingScoring about changing scoring. The problem is that a document score is highly dependent on *what* matched, and that the getSpans() method on spanqueries does not provide that kind of information. I created my own SpanQuery subclasses which override the createWeight method so that the scorer used is my own too. It basically replaces the SpanScorer, and should recurse the spans tree to compose a score based on the type of subqueries (near, and, or, not) and what matched. The problem is that the getspans() methods that exists in Lucene are either anonymous classes which I cannot browse, or that I have not access to the required information. Basically, in a SpanOrQuery, I am not able to find out what matched. Have any of you faced that kind of problem, and found out an elegant way to do it without having to completely rewrite each getSpans() method for all types of queries (this is basically what was done in a previous version of the application) ? Thanks, Cedric -- View this message in context: http://www.nabble.com/Span-queries-and-complex-scoring-tf4422915.html#a12615745 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]