[jira] Commented: (LUCENE-666) TERM1 OR NOT TERM2 does not perform as expected

Steven Parkes (JIRA) Thu, 31 Aug 2006 10:13:54 -0700

    [ 
http://issues.apache.org/jira/browse/LUCENE-666?page=comments#action_12431921 ] 
            
Steven Parkes commented on LUCENE-666:
--------------------------------------


I think the idea of a parse exception is good. The issue is relatively subtle.  
A user can relatively easily generate a query that ignores terms. Doing that 
silently is just asking for people to spend a lot of time trying to understand 
why they aren't getting something they expect.

The question is whether it's feasible to detect that an invalid query has been 
entered. The problem is of course having to ever enumerate all documents. But 
with multilevel queries (sums of products of sums, etc.), it's not immediately 
obvious from the parse tree whether that is  being attempted or not.

a(b  + !c) is okay but (a+!d)(b + !c) isn't. Basically you can't do a 
term-by-term analysis of a multilevel query and decide. It's a global property 
of the query.

It one reduces the multilevel query to a two level sum-of-products form (i.e., 
distribute all AND operations over all OR operaitons), you can inspect the 
result. If any term does not contain a positive term, the query is invalid. But 
going from multilevel to two level can generate an exponential number of terms. 
Maybe not an issue in IR, but it makes me nervous.

I would think that somewhere between query parsing and search execution, the 
invalidity of a query would have to pop out but I haven't delved into this path 
enough to know where that might happen.

If no one's looking at this (generating an exception somewhere) and no one 
knows of any reason it's not possible, I'll look ...

On a related topic, and maybe a naive question, has anyone considered keeping a 
posting that contains all documents? Is it really that hard to do in Lucene 
today? That would remove this restriction. Perhaps the issue then is that the 
combination of the number of valid use cases combined with the number of times 
it would cause people trouble because they were using it invalidly means it's 
not a good idea? I can come up with some potentially interesting queries, 
especially when you combine with other features like facets. "Tell me the facet 
breakdown of all documents that don't contain the word microsoft ..." In many 
places, !microsoft is not such a terribly big set and can be computed with 
about the same effort that microsoft can. (Is that true?)

> TERM1 OR NOT TERM2 does not perform as expected
> -----------------------------------------------
>
>                 Key: LUCENE-666
>                 URL: http://issues.apache.org/jira/browse/LUCENE-666
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 2.0.0
>         Environment: Windows XP, JavaCC 4.0, JDK 1.5
>            Reporter: Dejan Nenov
>         Attachments: TestAornotB.java
>
>
> test:
>     [junit] Testsuite: org.apache.lucene.search.TestAornotB
>     [junit] Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 0.39 sec
>     [junit] ------------- Standard Output ---------------
>     [junit] Doc1 = A B C
>     [junit] Doc2 = A B C D
>     [junit] Doc3 = A   C D
>     [junit] Doc4 =   B C D
>     [junit] Doc5 =     C D
>     [junit] -------------------------------------------------
>     [junit] With query "A OR NOT B" we expect to hit
>     [junit] all documents EXCEPT Doc4, instead we only match on Doc3.
>     [junit] While LUCENE currently explicitly does not support queries of
>     [junit] the type "find docs that do not contain TERM" - this explains
>     [junit] not finding Doc5, but does not justify elimnating Doc1 and Doc2
>     [junit] -------------------------------------------------
>     [junit]  the fix shoould likely require a modification to QueryParser.jj
>     [junit]  around the method:
>     [junit]  protected void addClause(Vector clauses, int conj, int mods, 
> Query q)
>     [junit] Query:c:a -c:b hits.length=1
>     [junit] Query Found:Doc[0]= A C D
>     [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of 
> required/prohibited clause(s)
>     [junit]   0.6115718 = (MATCH) fieldWeight(c:a in 1), product of:
>     [junit]     1.0 = tf(termFreq(c:a)=1)
>     [junit]     1.2231436 = idf(docFreq=3)
>     [junit]     0.5 = fieldNorm(field=c, doc=1)
>     [junit]   0.0 = match on prohibited clause (c:b)
>     [junit]     0.6115718 = (MATCH) fieldWeight(c:b in 1), product of:
>     [junit]       1.0 = tf(termFreq(c:b)=1)
>     [junit]       1.2231436 = idf(docFreq=3)
>     [junit]       0.5 = fieldNorm(field=c, doc=1)
>     [junit] 0.6115718 = (MATCH) sum of:
>     [junit]   0.6115718 = (MATCH) fieldWeight(c:a in 2), product of:
>     [junit]     1.0 = tf(termFreq(c:a)=1)
>     [junit]     1.2231436 = idf(docFreq=3)
>     [junit]     0.5 = fieldNorm(field=c, doc=2)
>     [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of 
> required/prohibited clause(s)
>     [junit]   0.0 = match on prohibited clause (c:b)
>     [junit]     0.6115718 = (MATCH) fieldWeight(c:b in 3), product of:
>     [junit]       1.0 = tf(termFreq(c:b)=1)
>     [junit]       1.2231436 = idf(docFreq=3)
>     [junit]       0.5 = fieldNorm(field=c, doc=3)
>     [junit] Query:c:a (-c:b) hits.length=3
>     [junit] Query Found:Doc[0]= A B C
>     [junit] Query Found:Doc[1]= A B C D
>     [junit] Query Found:Doc[2]= A C D
>     [junit] 0.3057859 = (MATCH) product of:
>     [junit]   0.6115718 = (MATCH) sum of:
>     [junit]     0.6115718 = (MATCH) fieldWeight(c:a in 1), product of:
>     [junit]       1.0 = tf(termFreq(c:a)=1)
>     [junit]       1.2231436 = idf(docFreq=3)
>     [junit]       0.5 = fieldNorm(field=c, doc=1)
>     [junit]   0.5 = coord(1/2)
>     [junit] 0.3057859 = (MATCH) product of:
>     [junit]   0.6115718 = (MATCH) sum of:
>     [junit]     0.6115718 = (MATCH) fieldWeight(c:a in 2), product of:
>     [junit]       1.0 = tf(termFreq(c:a)=1)
>     [junit]       1.2231436 = idf(docFreq=3)
>     [junit]       0.5 = fieldNorm(field=c, doc=2)
>     [junit]   0.5 = coord(1/2)
>     [junit] 0.0 = (NON-MATCH) product of:
>     [junit]   0.0 = (NON-MATCH) sum of:
>     [junit]   0.0 = coord(0/2)
>     [junit] ------------- ---------------- ---------------
>     [junit] Testcase: testFAIL(org.apache.lucene.search.TestAornotB):   FAILED
>     [junit] resultDocs =A C D expected:<3> but was:<1>
>     [junit] junit.framework.AssertionFailedError: resultDocs =A C D 
> expected:<3> but was:<1>
>     [junit]     at 
> org.apache.lucene.search.TestAornotB.testFAIL(TestAornotB.java:137)
>     [junit] Test org.apache.lucene.search.TestAornotB FAILED

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-666) TERM1 OR NOT TERM2 does not perform as expected

Reply via email to