Re: [Nutch-general] Preventing pages to be indexed based on content

Eelco Lempsink Fri, 27 Oct 2006 00:26:23 -0700

On 25-okt-2006, at 18:26, Andrzej Bialecki wrote:

Eelco Lempsink wrote:
Of course, for high volumes of data first indexing, and afterwards removing it, doesn't sound like a good option in my case where only a small part of the fetched data needs to be indexed.
Has anyone solved this problem (elegantly)? I mainly wonder if it's feasible to do it only using plugins, since I suspect I must implement my own Indexer.
Plugins may also return null doc. Standard Indexer would have to be modified to handle this gracefully, but it's trivial:

Thank you, that's indeed a good solution. The only thing that bothers me is that plugins _may_ return null doc's, but it's not handled well. (In other words, by reading the code I didn't get the idea that returning a null doc would be okay.) I submitted a bug report for this (https://issues.apache.org/jira/browse/NUTCH-393).


--
Regards,

Eelco Lempsink

PGP.sig
Description: This is a digitally signed message part

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Preventing pages to be indexed based on content

Reply via email to