Re: Update on ignoring menu divs

Ken Krugler Mon, 01 Mar 2010 05:45:51 -0800


On Feb 28, 2010, at 10:24pm, Sami Siren wrote:

Andrzej Bialecki wrote:
On 2010-02-28 18:42, Ian M. Evans wrote:
Using Nutch as a crawler for solr.

I've been digging around the nutch-user archives a bit and have seen
some people discussing how to ignore menu items or otherunnecessary div
areas like common footers, etc. I still haven't come across a full
answer yet.
There is no such functionality out of the box. One direction thatis worth pursuing would be to create an HtmlParseFilter plugin thatwraps the Boilerpipe library http://code.google.com/p/boilerpipe/ .
Andrzej, have you tested that lib? If the result is of decentquality it would be nice to have that wrapped as a plugin in Nutch.

We've done some testing of it with general web crawls, and it seemspromising.

I've got a patch that modifies Boilerpipe so it works better as a plug-in to Tika, so if/when that gets rolled into the source then it shouldbe easier to use the project with Nutch.


-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Update on ignoring menu divs

Reply via email to