Jeremy Calvert wrote:
I'm working on a project wherein we would like to
perform Lucene searches over several Nutch-generated
indexes residing on separate servers. DistributedSearch (and subclasses) provide us means to
do this for boolean queries. But we want more
flexibility and so we're expanding DistributedSearch
to allow other lucene queries such as span, fuzzy,
range, etc.

It should not be hard to implement these as Nutch QueryFilter plugins. Thus, one could add "fuzzy:foo" or "range:foo-bar" to a Nutch query and the plugin would translate these into appropriate Lucene clauses and add them to the generated Lucene query. Does this make sense?


Doing so raises a question:  Why does
net.nutch.searcher.Query implement the Writable
interface (which uses java.io.Data**put serialization)
while org.apache.lucene.search.Query implements
java.io.Serializable?  It's seems that it would have
been simpler to write the DistributedSearch package
were they both to implement the same serialization
(hence simpler for us to expand upon).  I'm guessing
there are specific reasons that they ended up being
written this way, and was hoping you could let me know
what they are.

The translation from Nutch query to Lucene query happens locally on each search node, so that it can utilize index-specific information, so we do not need to serialize the Lucene query.


Nutch uses it's own serialization and IPC implementations instead of Java's serialization and RMI for better scalablilty, reliability and performance. Nutch's serialization is more compact and faster to read/write, Nutch tightly controls the use and re-use of network sockets and threads and gracefully handles node and request failures. Nutch's goal is to scale to hundreds or thousands of nodes, where these issues become critical: network bandwidth becomes a precious commodity and machines fail regularly. Nutch's IPC system may not yet perfectly handle these situations, but I feel it's a better foundation than RMI and Java Serialization.

Doug


------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to