Re: [Nutch-dev] Re: PruneIndexTool

Michael Nebel Fri, 19 Nov 2004 11:41:38 -0800

Hi,

I think I got my mistake:

        content:"wordA wordB"

returns the two pages and this are the results highlighted in the webfrontend.

        content:"wordA wordB wordC"

returns the same results (not showing "wordC" in the resultpage). The first result has a score of > 2000 and the second of .11 !

ok - thanks!

        Michael


Michael Nebel schrieb:

Hi Andrzej,
(I have been to stupid to set the correct subject last time, so I correct it now).
- using the webfrontend I have a query "wordA wordB wordC" which returns
  2 results with different URLs.
A very important thing is that PruneIndexTool uses a DIFFERENT syntax for queries than the Nutch web frontend. The syntax for the tool is Lucene QueryParser syntax - please see the javadoc comments for an example.
as far as I checked, I got the lucene query syntax, but this did not help either. Have I gotten the wrong page? http://jakarta.apache.org/lucene/docs/queryparsersyntax.html

BTW: have you implemented a "not"-query"? I want to remove all documents with 'lang:(NOT "en" and NOT "de")', but the output looks not ok. I have french and danish pages in my test system. Queries like
    lang:"fr"
or
    lang:"da"
or
    lang:"fr" or lang:"da"
shows them.
- The I tried to remove the pages using PruneIndex:
    content: "wordA wordB wordC"
First of all, there must be no space between the field name, colon, and the query term. I assume it's just a transcription error, and not the real query...
I had no spaces, but I just tried: with spaces it's the same result ?-0
Checking the fields it's really the same result with and without.
Anyway, this query means that you want to match all documents, which contain "wordA wordB wordC" as an exact phrase in the content field. Probably not what you wanted... you probably wanted something like:
no - I want to delete the page with all three words in. The webfrontend shows two different results (in fact it's a special testcase I build :-).
Concerning the example of your original posting:
    content:wordA +url:"abc"
returns the same results than
    url:"abc"
The wordA makes no different.
    +content:"wordA" +url:"abc"
works as expected.
I think, I'm making a mistake, but which?
Regards
    Michael
PS.: But the tool itself is a great idea! Thanks!
-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Re: PruneIndexTool

Reply via email to