Re: [Nutch-dev] Re: PruneIndexTool

Andrzej Bialecki Fri, 19 Nov 2004 11:50:13 -0800

Michael Nebel wrote:

as far as I checked, I got the lucene query syntax, but this did not help either. Have I gotten the wrong page? http://jakarta.apache.org/lucene/docs/queryparsersyntax.html


Yes, that's the right page - however, read on...

BTW: have you implemented a "not"-query"? I want to remove all documents with 'lang:(NOT "en" and NOT "de")', but the output looks not ok. I have

Lucene does not implement pure NOT query - it only implements an intersection of some other query and NOT query.

There are good reasons (most of them performance-related) why implementing a pure NOT query is difficult/undesirable. If you want you can achieve a similar effect by using a token that exists in all documents and intersecting it with NOT query, e.g.:

+content:wordA AND NOT lang(de OR en)

french and danish pages in my test system. Queries like

    lang:"fr"
or
    lang:"da"

or

    lang:"fr" or lang:"da"

shows them.


Yes, that's expected.

- The I tried to remove the pages using PruneIndex:
    content: "wordA wordB wordC"
First of all, there must be no space between the field name, colon, and the query term. I assume it's just a transcription error, and not the real query...
I had no spaces, but I just tried: with spaces it's the same result ?-0
Checking the fields it's really the same result with and without.
Anyway, this query means that you want to match all documents, which contain "wordA wordB wordC" as an exact phrase in the content field. Probably not what you wanted... you probably wanted something like:
no - I want to delete the page with all three words in. The webfrontend shows two different results (in fact it's a special testcase I build :-).

The webfrontend uses a different query syntax. Eventually somewhere down in the pipeline this query is translated into a Lucene query, but not in the way you would expect it...

Please use (explain) page in the web application to see why and how the pages got their score - if possible, please send me this output (you can do it off-the-list to my address ab at getopt dot org, if it's sensitive). I suspect that the matches you get from the web application don't match the whole exact phrase, but just match some of the query terms.


Concerning the example of your original posting:

    content:wordA +url:"abc"

returns the same results than

url:"abc"

That's correct - the first query term is optional, because it's not preceded by a "+". So, this query matches also documents which don't contain content:wordA, but if they do contain it they get a higher score...


The wordA makes no different.

    +content:"wordA" +url:"abc"

works as expected.

Yes, because now you require that both terms must be present in the same document. If either of them is missing, you get a zero score.

I think, I'm making a mistake, but which?

Well, I'd say that this whole issue is a bit tricky, because of the complex interactions between Nutch and Lucene query syntaxes... Perhaps I should add an option to the tool to use Nutch syntax, so that it matches exactly the same documents as in the web application.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Re: PruneIndexTool

Reply via email to