Hello all,
I was trying to figure out the best method to crawl a site without
getting any of the irrelevant bits such as flash widgets, javascript,
links to ad networks, and others. The objective is to index all relevant
textual data. (This may be extrapolated to other forms of data of course)
My main question is - should this sort of elimination be done during the
crawl, which would mean modifying the crawler; or should everything be
crawled, indexed, and then have a text parsing system with some logic to
extract the relevant bits?
Using the crawl-urlfilter seems like the first option, but I believe it
has its drawbacks. Firstly, it needs regexps which match URLs, which
would have to be handwritten (even automated scripts would need human
manipulation at some point). For instance, the scripts or images may be
hosted at scripts.foo.com or foo.com/bar/foobar/scripts - both entries
are far apart to make automation tough. And any such customizations
would need to be tailor made for each site crawled - a tall task. Is
there a way to extend the crawler itself to do this? I remember seeing
something on the list archives about extending the crawler, but I can't
find it again anymore.. Any pointers?
The second option was to write some sort of a custom class for the
indexer (a form of the pluginexample on the wiki I guess).
Either way, I'm not sure what the better method is. Any ideas would be
appreciated!
Cheers,
Viksit
PS, Cross posted on nutch-user and nutch-agent, since I wasn't sure
which one was a better option.