On 2010-02-28 18:42, Ian M. Evans wrote:
Using Nutch as a crawler for solr.

I've been digging around the nutch-user archives a bit and have seen
some people discussing how to ignore menu items or other unnecessary div
areas like common footers, etc. I still haven't come across a full
answer yet.

Is there a to define a div by id that nutch will strip out before
tossing the content into solr?

There is no such functionality out of the box. One direction that is worth pursuing would be to create an HtmlParseFilter plugin that wraps the Boilerpipe library http://code.google.com/p/boilerpipe/ .

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to