Re: Update on ignoring menu divs

Andrzej Bialecki Sun, 28 Feb 2010 12:44:59 -0800

On 2010-02-28 18:42, Ian M. Evans wrote:

Using Nutch as a crawler for solr.


I've been digging around the nutch-user archives a bit and have seen
some people discussing how to ignore menu items or other unnecessary div
areas like common footers, etc. I still haven't come across a full
answer yet.

Is there a to define a div by id that nutch will strip out before
tossing the content into solr?

There is no such functionality out of the box. One direction that isworth pursuing would be to create an HtmlParseFilter plugin that wrapsthe Boilerpipe library http://code.google.com/p/boilerpipe/ .


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Update on ignoring menu divs

Reply via email to