Hi Guys, OK further to my ridiculous question regarding where the module actually is, I would like to pose some more relevant thoughts.
A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion which was included within the Incubator proposal for a Nutch Any23 plugin. As you know, currently the crawling in the basic-crawler plugin is done via crawler4j, @ Apache we are great believers of eat your own dog food, therefore my proposal would be to remove the dependencies on crawler4j if I was building the Nutch implementation using instead Nutch interfaces and functionality. This kind of leads on to my question as to 1) Should the basic-crawler plugin be kept within Any23? My own thoughts are that it provides a real nice and easy way to test out Any23 functionality, however should 'crawling' functionality be part of a project which describes itself as "a library, a web service and a command line tool that extracts structured data in RDF format from a variety of Web documents."? 2) The knock-on effect of removing this module and porting it directly to Nutch would be that to test out Any23 libraries within a crawler you would need a working knowledge of Nutch... this could be putting up barriers to adoption... 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the any23-core library from the Apache repo and use this, I'm thinking of deduplicating as much code as possible between projects... Any ideas Thanks [1] https://issues.apache.org/jira/browse/NUTCH-1129 -- *Lewis*
