hi,

> So let's go :
> >
> > I would like to pass to droids an xml like (just an example) :
> > <article>
> >   <droids:url>http://example.com/test.html</droids:url>
>
> In droids crawling the url is the entrance point of the processing. What
> happens then is highly configurable and currently Ming Fai has suggested
> some changes for the future. I will describe the possibilities that
> droids currently offers for the presented use case.
>
> Like said we start with the queue where you inject the starting urls.
> Then this queue will call a worker (which basically is the part of the
> code where the real work is done). This worker may call a linkExtractor
> and/or a Parser to extract link and any other information about the
> incoming page.



I think most crawler (incl. Droids and any of my suggested change) works in
more or less the same way. We always have URL as seeds and be put in a
queue/list (TaskQueue in Droids),  a main component to control multi-thread
and execution (TaskMaster), components to fetch/retrieve the URL as
inputstream/entity (Worker and Protocol), components to parse/process the
inputstream/entity (Parser), components to extract outlinks (LinkExtractor)
and put back into the main queue/list.(Worker) Droids also has URLFilter
that accept/reject outlinks, TaskValidator to intecept at the
add-to-queue-time (that works similar as URLFilter for crawling, maybe u
could ignore this), DelayTimer to slow down the fetching. The above refers
to the current Droids implementation. I think it covers most of the main
concepts.

regards,
mingfai

Reply via email to