Byron Miller wrote:
I have confirmed that this is fixed, however the
results are pretty irrelevent to C++ :)

I'll have to dig further hehe

This is a tokenizer issue. Searches for "C++" are interpreted as searches for "c".


To see how queries are tokenized, parsed and translated try:

% bin/nutch net.nutch.searcher.Query

Queries are first tokenized, using the same tokenizer as is used for documents. Query tokens are next parsed into instances of net.nutch.searcher.Query, interpreting plusses, minuses and quotes. The parsed query is displayed using Nutch's query syntax, described at http://www.nutch.org/docs/en/help.html.

Finally Nutch queryeis are translated into Lucene queries (instances of org.apache.lucene.search.Query). The translated query is displayed using Lucene's query syntax, described at http://jakarta.apache.org/lucene/docs/queryparsersyntax.html.

The tokenizer could be altered to better handle "C++". The file to edit is NutchAnalysis.jj. The 'generate-src' ant target runs JavaCC, to generate NutchAnalysis.java. JavaCC documentation can be found at https://javacc.dev.java.net/.

Doug


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to