[ http://issues.apache.org/jira/browse/LUCENE-323?page=comments#action_12361003 ]
Otis Gospodnetic commented on LUCENE-323: ----------------------------------------- Yonik - you committed this? The case is still showing as open, so you may want to close it if you're done with it. > [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate > support for queries across multiple fields > ----------------------------------------------------------------------------------------------------------------- > > Key: LUCENE-323 > URL: http://issues.apache.org/jira/browse/LUCENE-323 > Project: Lucene - Java > Type: Bug > Components: QueryParser > Versions: 1.4 > Environment: Operating System: Windows XP > Platform: PC > Reporter: Chuck Williams > Assignee: Lucene Developers > Attachments: DisjunctionMaxQuery.java, DisjunctionMaxScorer.java, > TestDisjunctionMaxQuery.java, TestMaxDisjunctionQuery.java, TestRanking.zip, > TestRanking.zip, TestRanking.zip, WikipediaSimilarity.java, > WikipediaSimilarity.java, WikipediaSimilarity.java, dms.tar.gz > > The attached test case demonstrates this problem and provides a fix: > 1. Use a custom similarity to eliminate all tf and idf effects, just to > isolate what is being tested. > 2. Create two documents doc1 and doc2, each with two fields title and > description. doc1 has "elephant" in title and "elephant" in description. > doc2 has "elephant" in title and "albino" in description. > 3. Express query for "albino elephant" against both fields. > Problems: > a. MultiFieldQueryParser won't recognize either document as containing > both terms, due to the way it expands the query across fields. > b. Expressing query as "title:albino description:albino title:elephant > description:elephant" will score both documents equivalently, since each > matches two query terms. > 4. Comparison to MaxDisjunctionQuery and my method for expanding queries > across fields. Using notation that () represents a BooleanQuery and ( | ) > represents a MaxDisjunctionQuery, "albino elephant" expands to: > ( (title:albino | description:albino) > (title:elephant | description:elephant) ) > This will recognize that doc2 has both terms matched while doc1 only has 1 > term matched, score doc2 over doc1. > Refinement note: the actual expansion for "albino query" that I use is: > ( (title:albino | description:albino)~0.1 > (title:elephant | description:elephant)~0.1 ) > This causes the score of each MaxDisjunctionQuery to be the score of highest > scoring MDQ subclause plus 0.1 times the sum of the scores of the other MDQ > subclauses. Thus, doc1 gets some credit for also having "elephant" in the > description but only 1/10 as much as doc2 gets for covering another query > term > in its description. If doc3 has "elephant" in title and both "albino" > and "elephant" in the description, then with the actual refined expansion, it > gets the highest score of all (whereas with pure max, without the 0.1, it > would get the same score as doc2). > In real apps, tf's and idf's also come into play of course, but can affect > these either way (i.e., mitigate this fundamental problem or exacerbate it). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]