[jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields

Chuck Williams (JIRA) Mon, 14 Nov 2005 14:55:49 -0800

    [ 
http://issues.apache.org/jira/browse/LUCENE-323?page=comments#action_12357635 ]


Chuck Williams commented on LUCENE-323:
---------------------------------------

The code only uses bubble sort for the incremental resorting of an 
already-sorted list.  The initial sort is done with Arrays.sort() which is 
O(n*logn).  The incremental resort is O(k*n) where k is the number of clauses 
that match the document last generated.  Even if n is large, k will usually be 
small.  Theoretically this is O(n^2) because k could be as high as n, but this 
is extremely unlikely especially when n is large.    More likely is that k is 
bounded by a small constant, in which case the algorithm is O(n).  It's like 
Quicksort in that regard -- there are outlier cases where it won't perform 
well, but it will perform better than most alternatives for the vast majority 
of cases.

Resorting the whole list every time would perform worse.  The best algorithm 
would probably be to use the standard insert and delete operations on a heap 
(as in heap sort):

    while top element generated last doc
        heap remove it
        generate it
        heap insert it

This would yield total time O(k*logn), as with a PriorityQueue.

I don't think this is much of an issue to worry about, but the algorithm could 
be revised to use the heap sort operations if others think it is important.

Chuck


> [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate 
> support for queries across multiple fields
> -----------------------------------------------------------------------------------------------------------------
>
>          Key: LUCENE-323
>          URL: http://issues.apache.org/jira/browse/LUCENE-323
>      Project: Lucene - Java
>         Type: Bug
>   Components: QueryParser
>     Versions: 1.4
>  Environment: Operating System: Windows XP
> Platform: PC
>     Reporter: Chuck Williams
>     Assignee: Lucene Developers
>  Attachments: DisjunctionMaxQuery.java, DisjunctionMaxScorer.java, 
> TestDisjunctionMaxQuery.java, TestMaxDisjunctionQuery.java, TestRanking.zip, 
> TestRanking.zip, TestRanking.zip, WikipediaSimilarity.java, 
> WikipediaSimilarity.java, WikipediaSimilarity.java
>
> The attached test case demonstrates this problem and provides a fix:
>   1.  Use a custom similarity to eliminate all tf and idf effects, just to 
> isolate what is being tested.
>   2.  Create two documents doc1 and doc2, each with two fields title and 
> description.  doc1 has "elephant" in title and "elephant" in description.  
> doc2 has "elephant" in title and "albino" in description.
>   3.  Express query for "albino elephant" against both fields.
> Problems:
>       a.  MultiFieldQueryParser won't recognize either document as containing 
> both terms, due to the way it expands the query across fields.
>       b.  Expressing query as "title:albino description:albino title:elephant 
> description:elephant" will score both documents equivalently, since each 
> matches two query terms.
>   4.  Comparison to MaxDisjunctionQuery and my method for expanding queries 
> across fields.  Using notation that () represents a BooleanQuery and ( | ) 
> represents a MaxDisjunctionQuery, "albino elephant" expands to:
>         ( (title:albino | description:albino)
>           (title:elephant | description:elephant) )
> This will recognize that doc2 has both terms matched while doc1 only has 1 
> term matched, score doc2 over doc1.
> Refinement note:  the actual expansion for "albino query" that I use is:
>         ( (title:albino | description:albino)~0.1
>           (title:elephant | description:elephant)~0.1 )
> This causes the score of each MaxDisjunctionQuery to be the score of highest 
> scoring MDQ subclause plus 0.1 times the sum of the scores of the other MDQ 
> subclauses.  Thus, doc1 gets some credit for also having "elephant" in the 
> description but only 1/10 as much as doc2 gets for covering another query 
> term 
> in its description.  If doc3 has "elephant" in title and both "albino" 
> and "elephant" in the description, then with the actual refined expansion, it 
> gets the highest score of all (whereas with pure max, without the 0.1, it 
> would get the same score as doc2).
> In real apps, tf's and idf's also come into play of course, but can affect 
> these either way (i.e., mitigate this fundamental problem or exacerbate it).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields

Reply via email to