Can Lucene tells which field matched ?
Hi I am new to Lucene and working on a search module for some XML data: I need to provide a search all able to look in all xml fields. Apparently Lucene (2.4.0) does not provide such a search all facility, and I have to build a query with my search field associated to all available XML elements. Assuming that I am searching in a address book (fictive example for illustration) which is made of contacts (my lucene documents) containing several fields like name, address, city, ... So my search for paul inside my addressbook will look like: name:paul OR address:paul OR city:paul and so on... Lucene will then tell me which contacts match my query, but is there a way to know which field(s) matched the request ? The goal is to display the XML with the matching fields highlighted. I did not found anything like this in Lucene, so I seems that the only way is to perform a additional search field by field... So if I have 100 fields per document (I told you my address book was a fictive example, the XML I am working on are a little bit more complex), and get 100 results that I want ot display in a list, this mean that I would need to perform 1 additional searches request... Please tell me that there is a better way to do the job... -- View this message in context: http://www.nabble.com/Can-Lucene-tells-which-field-matched---tp20357552p20357552.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
possible score value
Hello, I have been going through the scoring documentation and code. I had the expectation that Lucene would enforce a score value between [0,1]. But from what I can grasp from the code and docs, score values can be greater than one. Does Lucene considers score values greater than 1 as valid? Kind regards, -- Francisco
Re: possible score value
Hi Fransisco, Did you come across : scoreNorm = 1.0f / topDocs.getMaxScore(); or something of this sort in Hits? As per my knowledge, the initial score is more than 1 but finally the scores get divided by the maxScore of the matched doc set. i.e. Setting an upper limit of 1 (for the max scorer). Hope this clarifies things! :) -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Thu, Nov 6, 2008 at 4:20 PM, Francisco Borges [EMAIL PROTECTED] wrote: Hello, I have been going through the scoring documentation and code. I had the expectation that Lucene would enforce a score value between [0,1]. But from what I can grasp from the code and docs, score values can be greater than one. Does Lucene considers score values greater than 1 as valid? Kind regards, -- Francisco
Re: BoostingTermQuery scoring
Not sure, but it sounds like you are interested in a higher level Query, kind of like the BooleanQuery, but then part of it sounds like it is per document, right? Is it that you want to deal with multiple payloads in a document, or multiple BTQs in a bigger query? On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote: I'm using BoostingTermQuery to boost the score of documents with terms containing payloads (boost value 1). I'd like to change the scoring behavior such that if a query contains multiple BoostingTermQuery terms (either required or optional), documents containing more matching terms with payloads always score higher than documents with fewer terms with payloads. Currently, if one of the terms has a high IDF weight and contains a boosting payload but no payloads on other matching terms, it may score higher than docs with other matching terms with payloads and lower IDF. I think what I need is a way to increase the weight of a matching term in BoostingSpanScorer.score() if 'payloadsSeen 0', but I don't see how to do this. Any suggestions? Thanks, Peter -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Can Lucene tells which field matched ?
Hi, I have implemented such a solution using the query explanation. IndexSearcher has an explain(Query query, int document) method that returns an Explanation object, on the Explanation object you can ask if it is a match with #isMatch(). You still need to repeat this for each found document though. Daan -Original Message- From: Dora [mailto:[EMAIL PROTECTED] Sent: donderdag 6 november 2008 10:19 To: java-user@lucene.apache.org Subject: Can Lucene tells which field matched ? Hi I am new to Lucene and working on a search module for some XML data: I need to provide a search all able to look in all xml fields. Apparently Lucene (2.4.0) does not provide such a search all facility, and I have to build a query with my search field associated to all available XML elements. Assuming that I am searching in a address book (fictive example for illustration) which is made of contacts (my lucene documents) containing several fields like name, address, city, ... So my search for paul inside my addressbook will look like: name:paul OR address:paul OR city:paul and so on... Lucene will then tell me which contacts match my query, but is there a way to know which field(s) matched the request ? The goal is to display the XML with the matching fields highlighted. I did not found anything like this in Lucene, so I seems that the only way is to perform a additional search field by field... So if I have 100 fields per document (I told you my address book was a fictive example, the XML I am working on are a little bit more complex), and get 100 results that I want ot display in a list, this mean that I would need to perform 1 additional searches request... Please tell me that there is a better way to do the job... -- View this message in context: http://www.nabble.com/Can-Lucene-tells- which-field-matched---tp20357552p20357552.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Can Lucene tells which field matched ?
Hi Daan, Can we have an exemple of your implementation? Thx Ulrich VACHON -Message d'origine- De : Daan de Wit [mailto:[EMAIL PROTECTED] Envoyé : jeudi 6 novembre 2008 11:35 À : java-user@lucene.apache.org Objet : RE: Can Lucene tells which field matched ? Hi, I have implemented such a solution using the query explanation. IndexSearcher has an explain(Query query, int document) method that returns an Explanation object, on the Explanation object you can ask if it is a match with #isMatch(). You still need to repeat this for each found document though. Daan -Original Message- From: Dora [mailto:[EMAIL PROTECTED] Sent: donderdag 6 november 2008 10:19 To: java-user@lucene.apache.org Subject: Can Lucene tells which field matched ? Hi I am new to Lucene and working on a search module for some XML data: I need to provide a search all able to look in all xml fields. Apparently Lucene (2.4.0) does not provide such a search all facility, and I have to build a query with my search field associated to all available XML elements. Assuming that I am searching in a address book (fictive example for illustration) which is made of contacts (my lucene documents) containing several fields like name, address, city, ... So my search for paul inside my addressbook will look like: name:paul OR address:paul OR city:paul and so on... Lucene will then tell me which contacts match my query, but is there a way to know which field(s) matched the request ? The goal is to display the XML with the matching fields highlighted. I did not found anything like this in Lucene, so I seems that the only way is to perform a additional search field by field... So if I have 100 fields per document (I told you my address book was a fictive example, the XML I am working on are a little bit more complex), and get 100 results that I want ot display in a list, this mean that I would need to perform 1 additional searches request... Please tell me that there is a better way to do the job... -- View this message in context: http://www.nabble.com/Can-Lucene-tells- which-field-matched---tp20357552p20357552.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Cet e-mail a été scanné par MessageLabs Email Security System. Pour plus d'informations, visitez http://www.messagelabs.com/email __ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
What does Sort.RELEVANCE do?
I can specify Sort.RELEVANCE to Searcher.search as in: hits = searcher.search(q, Sort.RELEVANCE); // Using deprecated method to make it short What is the real effect of specifying the Sort argument like this? Does Sort.RELEVANCE sorts the hits in order of the score shown in Sect. 3.3 Understanding Lucene scoring of Lucene In Action? If I use the search method without a sort argument, is it equivalent of specifying Sort.INDEXORDER? T. Kuro Kurosaka, Basis Technology San Francisco, California, U.S.A. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Global Field question (thread-safe)?
I have a use case where I want all of my documents to have - in addition to their other fields - a single field=value. An example use is where I have multiple Lucene indexes that I search in parallel, but still need to distinguish them. Index 1: All documents have: source=a1 Index 2: All documents have: source=a2 This is a common use case that has previously been discussed on this list. The particular question I have is: when I am indexing, can I create a single Field and use it for all Documents? Note I am in a multithreaded environment, so many Documents are created and will have this same Field added to them, and subsequently indexed. So are their any threading issues with this particular usage? thanks, Glen -- - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BoostingTermQuery scoring
Let me give some background on the problem behind my question. Our index contains many fields (title, body, date, city, etc). Most queries search all fields, but for best performance, we create an additional 'contents' field that contains all terms from all fields so that only one field needs to be searched. Some fields, like title and city, are boosted by a factor of 5. In order to make term boosting work, we create an additional field 'boost' that contains all the terms from the boosted fields (title, city). Then, at search time, a query for petroleum engineer gets rewritten to: (+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer). Note that the two clauses are OR'd so that a term that exists in both fields will get a higher weight in the 'boost' field. This works quite well at boosting documents with terms that exist in the boosted fields. However, it doesn't work properly if excluded terms are added, for example: (+contents:petroleum +contents:engineer -contents:drilling) (+boost:petroleum +boost:engineer -boost:drilling) If a document contains the term 'drilling' in the 'body' field, but not in the 'title' or 'city' field, a false hit occurs. Enter payloads and 'BoostingTermQuery'. At indexing time, as terms are added to the 'contents' field, they are assigned a payload (value=5) if the term also exists in one of the boosted fields. The 'scorePayload' method in our Similarity class returns the payload value as a score. The query no longer contains the 'boost' fields and is simply: +contents:petroleum +contents:engineer -contents:drilling The goal is to make the payload technique behavior similar to the 'boost' field technique. The problem is that relevance scores of the top hits are sometimes quite different. The reason is that the IDF values for a given term in the 'boost' field is often much higher than the same term in the 'contents' field. This makes sense because the 'boost' field contains a fairly small subset of the 'contents' field. Even with a payload of '5', a low IDF in the 'contents' field usually erases the effect of the payload. I have found a fairly simple (albeit inelegant) solution that seems to work. The 'boost' field is still created as before, but it is only used to compute IDF values for the weight class 'BoostingTermQuery.BoostingTermWeight. I had to make this class 'public' so that I could override the IDF value as follows: public class MNSBoostingTermQuery extends BoostingTermQuery { public MNSBoostingTermQuery(Term term) { super(term); } protected class MNSBoostingTermWeight extends BoostingTermQuery.BoostingTermWeight { public MNSBoostingTermWeight(BoostingTermQuery query, Searcher searcher) throws IOException { super(query, searcher); java.util.HashSetTerm newTerms = new java.util.HashSetTerm(); // Recompute IDF based on 'boost' field Iterator i = terms.iterator(); Term term=null; while (i.hasNext()) { term = (Term)i.next(); newTerms.add(new Term(boost, term.text())); } this.idf = this.query.getSimilarity(searcher).idf(newTerms, searcher); } } } Any thoughts about a better implementation are welcome. Peter On Thu, Nov 6, 2008 at 8:00 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: Not sure, but it sounds like you are interested in a higher level Query, kind of like the BooleanQuery, but then part of it sounds like it is per document, right? Is it that you want to deal with multiple payloads in a document, or multiple BTQs in a bigger query? On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote: I'm using BoostingTermQuery to boost the score of documents with terms containing payloads (boost value 1). I'd like to change the scoring behavior such that if a query contains multiple BoostingTermQuery terms (either required or optional), documents containing more matching terms with payloads always score higher than documents with fewer terms with payloads. Currently, if one of the terms has a high IDF weight and contains a boosting payload but no payloads on other matching terms, it may score higher than docs with other matching terms with payloads and lower IDF. I think what I need is a way to increase the weight of a matching term in BoostingSpanScorer.score() if 'payloadsSeen 0', but I don't see how to do this. Any suggestions? Thanks, Peter -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What does Sort.RELEVANCE do?
Section 5.1.2 of LIA also explains this. Sort.RELEVANCE sorts by relevance score, descending, breaking ties by sorting by doc ID, ascending, and s the default if you don't specify a sort order. Sort.INDEXORDER sorts only by doc ID, which is not the default sort. Mike Teruhiko Kurosaka wrote: I can specify Sort.RELEVANCE to Searcher.search as in: hits = searcher.search(q, Sort.RELEVANCE); // Using deprecated method to make it short What is the real effect of specifying the Sort argument like this? Does Sort.RELEVANCE sorts the hits in order of the score shown in Sect. 3.3 Understanding Lucene scoring of Lucene In Action? If I use the search method without a sort argument, is it equivalent of specifying Sort.INDEXORDER? T. Kuro Kurosaka, Basis Technology San Francisco, California, U.S.A. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global Field question (thread-safe)?
Thanks! :-) 2008/11/6 Michael McCandless [EMAIL PROTECTED]: The field never changes across all docs? If so, this will work fine. Mike Glen Newton wrote: I have a use case where I want all of my documents to have - in addition to their other fields - a single field=value. An example use is where I have multiple Lucene indexes that I search in parallel, but still need to distinguish them. Index 1: All documents have: source=a1 Index 2: All documents have: source=a2 This is a common use case that has previously been discussed on this list. The particular question I have is: when I am indexing, can I create a single Field and use it for all Documents? Note I am in a multithreaded environment, so many Documents are created and will have this same Field added to them, and subsequently indexed. So are their any threading issues with this particular usage? thanks, Glen -- - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BoostingTermQuery scoring
I've discovered another flaw in using this technique: (+contents:petroleum +contents:engineer +contents:refinery) (+boost:petroleum +boost:engineer +boost:refinery) It's possible that the first clause will produce a matching doc and none of the terms in the second clause are used to score that doc. Yet another reason to use BoostingTermQuery. Peter On Thu, Nov 6, 2008 at 1:08 PM, Peter Keegan [EMAIL PROTECTED] wrote: Let me give some background on the problem behind my question. Our index contains many fields (title, body, date, city, etc). Most queries search all fields, but for best performance, we create an additional 'contents' field that contains all terms from all fields so that only one field needs to be searched. Some fields, like title and city, are boosted by a factor of 5. In order to make term boosting work, we create an additional field 'boost' that contains all the terms from the boosted fields (title, city). Then, at search time, a query for petroleum engineer gets rewritten to: (+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer). Note that the two clauses are OR'd so that a term that exists in both fields will get a higher weight in the 'boost' field. This works quite well at boosting documents with terms that exist in the boosted fields. However, it doesn't work properly if excluded terms are added, for example: (+contents:petroleum +contents:engineer -contents:drilling) (+boost:petroleum +boost:engineer -boost:drilling) If a document contains the term 'drilling' in the 'body' field, but not in the 'title' or 'city' field, a false hit occurs. Enter payloads and 'BoostingTermQuery'. At indexing time, as terms are added to the 'contents' field, they are assigned a payload (value=5) if the term also exists in one of the boosted fields. The 'scorePayload' method in our Similarity class returns the payload value as a score. The query no longer contains the 'boost' fields and is simply: +contents:petroleum +contents:engineer -contents:drilling The goal is to make the payload technique behavior similar to the 'boost' field technique. The problem is that relevance scores of the top hits are sometimes quite different. The reason is that the IDF values for a given term in the 'boost' field is often much higher than the same term in the 'contents' field. This makes sense because the 'boost' field contains a fairly small subset of the 'contents' field. Even with a payload of '5', a low IDF in the 'contents' field usually erases the effect of the payload. I have found a fairly simple (albeit inelegant) solution that seems to work. The 'boost' field is still created as before, but it is only used to compute IDF values for the weight class 'BoostingTermQuery.BoostingTermWeight. I had to make this class 'public' so that I could override the IDF value as follows: public class MNSBoostingTermQuery extends BoostingTermQuery { public MNSBoostingTermQuery(Term term) { super(term); } protected class MNSBoostingTermWeight extends BoostingTermQuery.BoostingTermWeight { public MNSBoostingTermWeight(BoostingTermQuery query, Searcher searcher) throws IOException { super(query, searcher); java.util.HashSetTerm newTerms = new java.util.HashSetTerm(); // Recompute IDF based on 'boost' field Iterator i = terms.iterator(); Term term=null; while (i.hasNext()) { term = (Term)i.next(); newTerms.add(new Term(boost, term.text())); } this.idf = this.query.getSimilarity(searcher).idf(newTerms, searcher); } } } Any thoughts about a better implementation are welcome. Peter On Thu, Nov 6, 2008 at 8:00 AM, Grant Ingersoll [EMAIL PROTECTED]wrote: Not sure, but it sounds like you are interested in a higher level Query, kind of like the BooleanQuery, but then part of it sounds like it is per document, right? Is it that you want to deal with multiple payloads in a document, or multiple BTQs in a bigger query? On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote: I'm using BoostingTermQuery to boost the score of documents with terms containing payloads (boost value 1). I'd like to change the scoring behavior such that if a query contains multiple BoostingTermQuery terms (either required or optional), documents containing more matching terms with payloads always score higher than documents with fewer terms with payloads. Currently, if one of the terms has a high IDF weight and contains a boosting payload but no payloads on other matching terms, it may score higher than docs with other matching terms with payloads and lower IDF. I think what I need is a way to increase the weight of a matching term in BoostingSpanScorer.score() if 'payloadsSeen 0', but I don't see how to do this. Any suggestions? Thanks, Peter -- Grant Ingersoll Lucene Helpful Hints:
RE: BoostingTermQuery scoring
Hi Peter, On 11/06/2008 at 4:25 PM, Peter Keegan wrote: I've discovered another flaw in using this technique: (+contents:petroleum +contents:engineer +contents:refinery) (+boost:petroleum +boost:engineer +boost:refinery) It's possible that the first clause will produce a matching doc and none of the terms in the second clause are used to score that doc. Yet another reason to use BoostingTermQuery. I think you could address this, without BTQ, using something like: boost:(+petroleum +engineer +refinery) (+contents:(+petroleum +engineer +refinery) +((*:* -boost:petroleum) (*:* -boost:engineer) (*:* -boost:refinery))) The last three lines gives you the set of documents that are missing at least one of the terms in the boost field. The *:* thingy, indicating a MatchAllDocsQuery, is necessary to get all documents that don't have a given term; Lucene's (sub-)query document exclusion operation needs a non-empty set on which to operate. On 11/06/2008 at 1:08 PM, Peter Keegan wrote: Then, at search time, a query for petroleum engineer gets rewritten to: (+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer). Note that the two clauses are OR'd so that a term that exists in both fields will get a higher weight in the 'boost' field. This works quite well at boosting documents with terms that exist in the boosted fields. However, it doesn't work properly if excluded terms are added, for example: (+contents:petroleum +contents:engineer -contents:drilling) (+boost:petroleum +boost:engineer -boost:drilling) If a document contains the term 'drilling' in the 'body' field, but not in the 'title' or 'city' field, a false hit occurs. I think you could address this problem like this: +(boost:(+petroleum +engineer) (+contents:(+petroleum +engineer) +((*:* -boost:petroleum) (*:* -boost:engineer -contents:drilling You don't have to include -boost:drilling, because this condition is entailed by -contents:drilling. Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Boosting results
I'm interested in comments on the following problem. I have a set of documents. They fall into 3 categories. Call these categories A, B, and C. Each document has an indexed, non-tokenized field called category which contains A, B, or C (they are mutually exclusive categories). All of the documents contain a field called body which contains a bunch of text. This field is indexed and tokenized. So, I want to do a search which looks something like: (category:A OR category:B) AND body:fred I want all of the category A documents to come before the category B documents. Effectively, I want to have the category A documents first (sorted by relevancy) and then the category B documents after (sorted by relevancy). I thought I could do this by boosting the category portion of the query, but that doesn't seem to work consistently. I was setting the boost on the category A term to 1.0 and the boost on the category B term to 0.0. Any thoughts how to skin this? Scott
Re: Boosting results
It seems to me that the easiest thing would be to fire two queries and then just concatenate the results category:A AND body:fred category:B AND body:fred If you really, really didn't want to fire two queries, you could create filters on category A and category B and make a couple of passes through your results seeing if the returned documents were in the filter, but you'd still concatenate the results. Actually in your specific example you could make one filter on A. You could also consider a custom scorer that, added 1,000,000 to every category A document. How much were you boosting by? What happens if you boost by a very large factor? As in ridiculously large? Best Erick On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith [EMAIL PROTECTED]wrote: I'm interested in comments on the following problem. I have a set of documents. They fall into 3 categories. Call these categories A, B, and C. Each document has an indexed, non-tokenized field called category which contains A, B, or C (they are mutually exclusive categories). All of the documents contain a field called body which contains a bunch of text. This field is indexed and tokenized. So, I want to do a search which looks something like: (category:A OR category:B) AND body:fred I want all of the category A documents to come before the category B documents. Effectively, I want to have the category A documents first (sorted by relevancy) and then the category B documents after (sorted by relevancy). I thought I could do this by boosting the category portion of the query, but that doesn't seem to work consistently. I was setting the boost on the category A term to 1.0 and the boost on the category B term to 0.0. Any thoughts how to skin this? Scott