Further to this, the Basic crawler plugin took some 4 mins to download dependencies, install and test...
Seems a lot of overhead for a plugin which is not even mentioned in the project description. Considering the overall build took some 8 mins locally. ... On Fri, Jan 13, 2012 at 1:16 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Guys, > > OK further to my ridiculous question regarding where the module actually > is, I would like to pose some more relevant thoughts. > > A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion > which was included within the Incubator proposal for a Nutch Any23 plugin. > As you know, currently the crawling in the basic-crawler plugin is done via > crawler4j, @ Apache we are great believers of eat your own dog food, > therefore my proposal would be to remove the dependencies on crawler4j if I > was building the Nutch implementation using instead Nutch interfaces and > functionality. This kind of leads on to my question as to > > 1) Should the basic-crawler plugin be kept within Any23? My own thoughts > are that it provides a real nice and easy way to test out Any23 > functionality, however should 'crawling' functionality be part of a project > which describes itself as "a library, a web service and a command line tool > that extracts structured data in RDF format from a variety of Web > documents."? > 2) The knock-on effect of removing this module and porting it directly to > Nutch would be that to test out Any23 libraries within a crawler you would > need a working knowledge of Nutch... this could be putting up barriers to > adoption... > 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the > any23-core library from the Apache repo and use this, I'm thinking of > deduplicating as much code as possible between projects... Any ideas > > Thanks > > [1] https://issues.apache.org/jira/browse/NUTCH-1129 > > -- > *Lewis* > > -- *Lewis*
