On 14 January 2012 17:35, Lewis John Mcgibbney <[email protected]>wrote:
> Hi Michele, > > I was thinking about replying to my original thread with some of the points > you make as I completely agree with your logic. Simone also mention the > importance of keeping the basic-crawler as a plugin and I agree with this > aswel. > That's great! > Once we get the Any23 packages changed to o.a.any23 rather than > a.deri.any23, this will allow us to push it to apache nexus, I'll begin > work on the Nutch-Any23 plugin. We'll take it from there. > Really good, I will start with the ANY23-21 just now. > > Thanks for getting back to me with your thoughts. > Please. The best. Mic > > On Sat, Jan 14, 2012 at 3:39 PM, Michele Mostarda < > [email protected]> wrote: > > > On 13 January 2012 14:21, Lewis John Mcgibbney < > [email protected] > > >wrote: > > > > > Further to this, the Basic crawler plugin took some 4 mins to download > > > dependencies, install and test... > > > > > > Seems a lot of overhead for a plugin which is not even mentioned in the > > > project description. Considering the overall build took some 8 mins > > > locally. > > > > > > > The Crawler plugin has been added with milestone 0.7.0, the documentation > > has not yet written. > > > > Mic > > > > > > > > > > ... > > > > > > On Fri, Jan 13, 2012 at 1:16 PM, Lewis John Mcgibbney < > > > [email protected]> wrote: > > > > > > > Hi Guys, > > > > > > > > OK further to my ridiculous question regarding where the module > > actually > > > > is, I would like to pose some more relevant thoughts. > > > > > > > > A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion > > > > which was included within the Incubator proposal for a Nutch Any23 > > > plugin. > > > > As you know, currently the crawling in the basic-crawler plugin is > done > > > via > > > > crawler4j, @ Apache we are great believers of eat your own dog food, > > > > therefore my proposal would be to remove the dependencies on > crawler4j > > > if I > > > > was building the Nutch implementation using instead Nutch interfaces > > and > > > > functionality. This kind of leads on to my question as to > > > > > > > > 1) Should the basic-crawler plugin be kept within Any23? My own > > thoughts > > > > are that it provides a real nice and easy way to test out Any23 > > > > functionality, however should 'crawling' functionality be part of a > > > project > > > > which describes itself as "a library, a web service and a command > line > > > tool > > > > that extracts structured data in RDF format from a variety of Web > > > > documents."? > > > > 2) The knock-on effect of removing this module and porting it > directly > > to > > > > Nutch would be that to test out Any23 libraries within a crawler you > > > would > > > > need a working knowledge of Nutch... this could be putting up > barriers > > to > > > > adoption... > > > > 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the > > > > any23-core library from the Apache repo and use this, I'm thinking of > > > > deduplicating as much code as possible between projects... Any ideas > > > > > > > > Thanks > > > > > > > > [1] https://issues.apache.org/jira/browse/NUTCH-1129 > > > > > > > > -- > > > > *Lewis* > > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > > > > > > -- > > Michele Mostarda > > Senior Software Engineer > > skype: michele.mostarda > > twitter: micmos > > mail: [email protected] > > site : http://www.michelemostarda.com > > > > > > -- > *Lewis* > -- Michele Mostarda Senior Software Engineer skype: michele.mostarda twitter: micmos mail: [email protected] site : http://www.michelemostarda.com
