--
Thorsten Scherler thorsten.at.apache.org
Open Source Java & XML consulting, training and solutions
--- Begin Message ---
On Tue, 2007-02-20 at 16:10 -0500, Renaud Richardet wrote:
> Hi Thorsten,
>
> I have quickly looked at the Droid code, and was wondering why you don't
> want to completely reuse the Nutch plugin API in Droid. This way, you
> could reuse the Nutch parse-* plugins without modifications. Just trying
> to understand...
Right now the build system around droids is a slimmed down nutch. I
removed all dependencies to hadoop, lucene or nutch to keep the core
independent. That is why you cannot directly reuse the nutch plugins
since e.g. the parse API is changed and other API a wee bit different to
make them more general.
I have not yet implement the parsing of the plugin manifests other then
extension related elements, since I want to have first a PoC. I do not
like the patching of core build files (adding plugins means to add a
line in the plugin target), further I prefer to use lib dependency
management aka ivy.
Having all this said it should be very easy to write a crawler that acts
like nutch and use the same plugins as nutch but would be based on
droids.
The first implementation is just a wget style crawler, but only as PoC
implementation. If there is an interest here for a nutch crawler plugin
for Droids I reckon one can write a basic implementation quite quickly.
The only question is whether there is an interest?
salu2
>
> Thanks,
> Renaud
>
>
> Thorsten Scherler wrote:
> > Hi all,
> >
> > I have finished a first version of an Apache Labs project called Apache
> > Droids and checked it in. All Apache committer have write access there
> > so fell all free to enhance the code. Like said all committer have write
> > access on droids and everybody is welcome to join the effort.
> >
> > Please see http://svn.apache.org/repos/asf/labs/droids/README.TXT to get
> > started.
> >
> > What is this?
> > -------------
> > Droids aims to be an intelligent standalone crawl framework that
> > automatically seeks out relevant online information based on the user's
> > specifications. The core is a simple crawler which can be
> > easily extended by plugins. So if a project/app needs special
> > processing for a crawled url one can easily write a plugin to implement
> > the functionality.
> >
> > Why was it created?
> > -------------------
> > Mainly because of personal curiosity:
> > The background of this work is that Cocoon trunk does not provide a
> > crawler anymore and Forrest is based on it, meaning we cannot update
> > anymore till we found a crawler replacement. Getting more involved in
> > Solr and Nutch I see request for a generic standalone crawler.
> >
> > How does the first implementation crawler-x-m02y07 looks like?
> > --------------------------------------------------------------
> > I took nutch, ripped out and modified the awesome plugin/extension
> > framework to create the droid core.
> > Now I could implement all funtionality in plugins. Droids should make
> > it very easy to extend it.
> > I wrote some proof of concept plugins that make up crawler-x-m02y07 to
> > - crawl an url (CrawlerImpl)
> > - extract links (only <a/> ATM) via a parse-html plugin
> > - merge them with the queue
> > - save or print out the crawled pages.
> >
> > Why crawler-x-m02y07?
> > ---------------------
> > Droids tries to be a framework for different droids.
> > The first implementation is a "crawler" with the name "x"
> > first archived in the second "m"onth of the "y"ear 20"07"
> >
> > Next steps
> > ----------
> > I still need to write a droids factory, that one can write
> > another implementation then Xm02y07 as crawler plugin and invoke it via
> > the Cli. Another todo is to implement a dependency system like Apache
> > Ivy instead to copycat the nutch approach.
> >
> > Open questions for nutch
> > ------------------------
> > Exists interest for a nutch crawler plugin to utilize native nutch
> > plugins and imitate the nutch crawler in Droids? Is nutch interested in
> > such a plugin? Does it makes sense?
> >
> > Please test and report feedback to [EMAIL PROTECTED] I will happily
> > answer all mails there.
> >
> > salu2
> >
>
>
--
Thorsten Scherler thorsten.at.apache.org
Open Source Java & XML consulting, training and solutions
--- End Message ---
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers