Re: NUTCH-479 "Support for OR queries" - what is this about

Andrzej Bialecki Sat, 07 Jul 2007 13:27:06 -0700

Briggs wrote:

Please keep this thread going as I am also curious to know why this
has been 'forked'.   I am sure that most of this lies within the
original OPIC filter but I still can't understand why straight forward
lucene queries have not been used within the application.

No, this has actually almost nothing to do with the scoring filters(which were added much later).

The decision to use a different query syntax than the one from Lucenewas motivated by a few reasons:

* to avoid the need to support low-level index and searcher operations,which the Lucene API would require us to implement.

* to keep the Nutch core largely independent of Lucene, so that it'spossible to use Nutch with different back-end searcher implementations.This started to materialize only now, with the ongoing effort to useSolr as a possible backend.

* to limit the query syntax to those queries that provide best tradeoffbetween functionality and performance, in a large-scale search engine.

On 7/6/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:

Ok, so I guess what I don't understand is what is the "Nutch querysyntax"?

Query syntax is defined in an informal way on the Help page innutch.war, or here:


http://wiki.apache.org/nutch/Features

Formal syntax definition can be gleaned fromorg.apache.nutch.analysis.NutchAnalysis.jj.


The main discussion I found on nutch-user is this:
http://osdir.com/ml/search.nutch.devel/2004-02/msg00007.html
    I was wondering why the query syntax is so limited.
    There are no OR queries, there are no fielded queries,
    or fuzzy, or approximate... Why? The underlying index
    supports all these operations.

Actually, it's possible to configure Nutch to allow raw field queries -you need to add a raw field query plugin for this. Pleae seeRawFieldQueryFilter class, and existing plugins that use fieldedqueries: query-site, and query-more. Query-more / DateQueryFilter isespecially interesting, because it shows how to use raw token valuesfrom a parsed query to build complex Lucene queries.

I notice by looking at the or.patch file(https://issues.apache.org/jira/secure/attachment/12360659/or.patch)that one of the programs under consideration is:
nutch/searcher/Query.java
The code for this is distinct from
lucene/search/Query.java

See above - they are completely different classes, with completelydifferent purpose. The use of the same class name is unfortunate andmisleading.

Nutch Query class is intended to express queries entered by searchengine users, in a tokenized and parsed way, so that the rest of Nutchmay deal with Clauses, Terms and Phrases instead of plain String-s.

On the other hand, Lucene Query is intended to express arbitrarilycomplex Lucene queries - many of these queries would be prohibitivelyexpensive for a large search engine (e.g. wildcard queries).

It looks like this is an architecture issue that I don't understand.If nutch is an "extension" of lucene, why does it define a differentQuery class?

Nutch is NOT an extension of Lucene. It's an application that usesLucene as a library.

Why don't we just use the Lucene code to query theindexes? Does this have something to do with the nutch webapp(nutch.war)? What is the historical genesis of this issue (or is thateven relevant)?

Nutch webapp doesn't have anything to do with it. The limitations in thequery syntax have different roots (see above).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: NUTCH-479 "Support for OR queries" - what is this about

Reply via email to