Re: [DISCUSS] Questions on Basic-Crawler Module

Michele Mostarda Sat, 14 Jan 2012 08:58:44 -0800

On 14 January 2012 17:35, Lewis John Mcgibbney <[email protected]>wrote:


> Hi Michele,
>
> I was thinking about replying to my original thread with some of the points
> you make as I completely agree with your logic. Simone also mention the
> importance of keeping the basic-crawler as a plugin and I agree with this
> aswel.
>

That's great!


> Once we get the Any23 packages changed to o.a.any23 rather than
> a.deri.any23, this will allow us to push it to apache nexus, I'll begin
> work on the Nutch-Any23 plugin. We'll take it from there.
>

Really good, I will start with the ANY23-21 just now.

>
> Thanks for getting back to me with your thoughts.
>

Please.

The best.

Mic


>
> On Sat, Jan 14, 2012 at 3:39 PM, Michele Mostarda <
> [email protected]> wrote:
>
> > On 13 January 2012 14:21, Lewis John Mcgibbney <
> [email protected]
> > >wrote:
> >
> > > Further to this, the Basic crawler plugin took some 4 mins to download
> > > dependencies, install and test...
> > >
> > > Seems a lot of overhead for a plugin which is not even mentioned in the
> > > project description. Considering the overall build took some 8 mins
> > > locally.
> > >
> >
> > The Crawler plugin has been added with milestone 0.7.0, the documentation
> > has not yet written.
> >
> > Mic
> >
> >
> > >
> > > ...
> > >
> > > On Fri, Jan 13, 2012 at 1:16 PM, Lewis John Mcgibbney <
> > > [email protected]> wrote:
> > >
> > > > Hi Guys,
> > > >
> > > > OK further to my ridiculous question regarding where the module
> > actually
> > > > is, I would like to pose some more relevant thoughts.
> > > >
> > > > A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion
> > > > which was included within the Incubator proposal for a Nutch Any23
> > > plugin.
> > > > As you know, currently the crawling in the basic-crawler plugin is
> done
> > > via
> > > > crawler4j, @ Apache we are great believers of eat your own dog food,
> > > > therefore my proposal would be to remove the dependencies on
> crawler4j
> > > if I
> > > > was building the Nutch implementation using instead Nutch interfaces
> > and
> > > > functionality. This kind of leads on to my question as to
> > > >
> > > > 1) Should the basic-crawler plugin be kept within Any23? My own
> > thoughts
> > > > are that it provides a real nice and easy way to test out Any23
> > > > functionality, however should 'crawling' functionality be part of a
> > > project
> > > > which describes itself as "a library, a web service and a command
> line
> > > tool
> > > > that extracts structured data in RDF format from a variety of Web
> > > > documents."?
> > > > 2) The knock-on effect of removing this module and porting it
> directly
> > to
> > > > Nutch would be that to test out Any23 libraries within a crawler you
> > > would
> > > > need a working knowledge of Nutch... this could be putting up
> barriers
> > to
> > > > adoption...
> > > > 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
> > > > any23-core library from the Apache repo and use this, I'm thinking of
> > > > deduplicating as much code as possible between projects... Any ideas
> > > >
> > > > Thanks
> > > >
> > > > [1] https://issues.apache.org/jira/browse/NUTCH-1129
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > Michele Mostarda
> > Senior Software Engineer
> > skype: michele.mostarda
> > twitter: micmos
> > mail: [email protected]
> > site : http://www.michelemostarda.com
> >
>
>
>
> --
> *Lewis*
>



-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: [email protected]
site : http://www.michelemostarda.com

Re: [DISCUSS] Questions on Basic-Crawler Module

Reply via email to