Daniel, The test case is now attached as Bug #32674. It's commented with lines from the email below to make the correspondence easy. Please let me know your thoughts,
Chuck > -----Original Message----- > From: Chuck Williams [mailto:[EMAIL PROTECTED] > Sent: Sunday, December 12, 2004 11:23 AM > To: Lucene Developers List > Subject: RE: Boolean Scorer > > Daniel, > > A perfectly reasonable request -- I'll put together a simple test case > but can't do it today. > > The problem is with scoring -- nothing to do with and queries. > > The test will run along these lines: > 1. Use a custom similarity to eliminate all tf and idf effects, just > to isolate what is being tested. > 2. Create two documents doc1 and doc2, each with two fields title and > description. doc1 has "elephant" in title and "elephant" in > description. doc2 has "elephant" in title and "albino" in description. > 3. Express query for "albino elephant" against both fields. > Problems: > a. MultiFieldQueryParser won't recognize either document as > containing both terms, due to the way it expands the query across > fields. > b. Expressing query as "title:albino description:albino > title:elephant description:elephant" will score both documents > equivalently, since each matches two query terms. > 4. Comparison to MaxDisjunctionQuery and my method for expanding > queries across fields. Using notation that () represents a BooleanQuery > and {} represents a MaxDisjunctionQuery, "albino elephant" expands to: > ( {title:albino description:albino} > {title:elephant description:elephant} ) > This will recognize that doc2 has both terms matched while doc1 only has > 1 term matched, score doc2 over doc1. > > Refinement note: the actual expansion for "albino query" that I use is: > ( {title:albino description:albino}~0.1 > {title:elephant description:elephant}~0.1 ) > This causes the score of each MaxDisjunctionQuery to be the score of > highest scoring MDQ subclause plus 0.1 times the sum of the scores of > the other MDQ subclauses. Thus, doc1 gets some credit for also having > "elephant" in the description but only 1/10 as much as doc2 gets for > covering another query term in its description. If doc3 has "elephant" > in title and both "albino" and "elephant" in the description, then with > the actual refined expansion, it gets the highest score of all (whereas > with pure max, without the 0.1, it would get the same score as doc2). > > In real apps, tf's and idf's also come into play of course, but can > affect these either way (i.e., mitigate this fundamental problem or > exacerbate it). > > Chuck > > > -----Original Message----- > > From: Daniel Naber [mailto:[EMAIL PROTECTED] > > Sent: Sunday, December 12, 2004 2:24 AM > > To: Lucene Developers List > > Subject: Re: Boolean Scorer > > > > On Sunday 12 December 2004 04:01, Chuck Williams wrote: > > > > > I maintain the belief that max is *required* to implement > reasonable > > > multi-field searching (1). > > > > Could you give a small example -- preferably a test case -- that > shows > > what > > the problem is? I know it has been discussed before but I hadn't > been > > able > > to follow that discussion closely enough. I assume the problem is in > the > > scoring, not in MultiFieldQueryParser. MultiFieldQueryParser has a > > different problem, namely that it doesn't correctly work with AND > > queries. > > Or is that the issue you're talking about? Anyway, that will be > fixed > > soon. > > > > Regards > > Daniel > > > > -- > > http://www.danielnaber.de > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]