[DISCUSS] Questions on Basic-Crawler Module

Lewis John Mcgibbney Fri, 13 Jan 2012 05:17:10 -0800

Hi Guys,

OK further to my ridiculous question regarding where the module actually
is, I would like to pose some more relevant thoughts.


A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion which
was included within the Incubator proposal for a Nutch Any23 plugin. As you
know, currently the crawling in the basic-crawler plugin is done via
crawler4j, @ Apache we are great believers of eat your own dog food,
therefore my proposal would be to remove the dependencies on crawler4j if I
was building the Nutch implementation using instead Nutch interfaces and
functionality. This kind of leads on to my question as to

1) Should the basic-crawler plugin be kept within Any23? My own thoughts
are that it provides a real nice and easy way to test out Any23
functionality, however should 'crawling' functionality be part of a project
which describes itself as "a library, a web service and a command line tool
that extracts structured data in RDF format from a variety of Web
documents."?
2) The knock-on effect of removing this module and porting it directly to
Nutch would be that to test out Any23 libraries within a crawler you would
need a working knowledge of Nutch... this could be putting up barriers to
adoption...
3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
any23-core library from the Apache repo and use this, I'm thinking of
deduplicating as much code as possible between projects... Any ideas

Thanks

[1] https://issues.apache.org/jira/browse/NUTCH-1129

-- 
*Lewis*

[DISCUSS] Questions on Basic-Crawler Module

Reply via email to