On Feb 28, 2010, at 10:24pm, Sami Siren wrote:

Andrzej Bialecki wrote:
On 2010-02-28 18:42, Ian M. Evans wrote:
Using Nutch as a crawler for solr.

I've been digging around the nutch-user archives a bit and have seen
some people discussing how to ignore menu items or other unnecessary div
areas like common footers, etc. I still haven't come across a full
answer yet.

There is no such functionality out of the box. One direction that is worth pursuing would be to create an HtmlParseFilter plugin that wraps the Boilerpipe library http://code.google.com/p/boilerpipe/ .
Andrzej, have you tested that lib? If the result is of decent quality it would be nice to have that wrapped as a plugin in Nutch.

We've done some testing of it with general web crawls, and it seems promising.

I've got a patch that modifies Boilerpipe so it works better as a plug- in to Tika, so if/when that gets rolled into the source then it should be easier to use the project with Nutch.

-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

Reply via email to