hi,
> So let's go : > > > > I would like to pass to droids an xml like (just an example) : > > <article> > > <droids:url>http://example.com/test.html</droids:url> > > In droids crawling the url is the entrance point of the processing. What > happens then is highly configurable and currently Ming Fai has suggested > some changes for the future. I will describe the possibilities that > droids currently offers for the presented use case. > > Like said we start with the queue where you inject the starting urls. > Then this queue will call a worker (which basically is the part of the > code where the real work is done). This worker may call a linkExtractor > and/or a Parser to extract link and any other information about the > incoming page. I think most crawler (incl. Droids and any of my suggested change) works in more or less the same way. We always have URL as seeds and be put in a queue/list (TaskQueue in Droids), a main component to control multi-thread and execution (TaskMaster), components to fetch/retrieve the URL as inputstream/entity (Worker and Protocol), components to parse/process the inputstream/entity (Parser), components to extract outlinks (LinkExtractor) and put back into the main queue/list.(Worker) Droids also has URLFilter that accept/reject outlinks, TaskValidator to intecept at the add-to-queue-time (that works similar as URLFilter for crawling, maybe u could ignore this), DelayTimer to slow down the fetching. The above refers to the current Droids implementation. I think it covers most of the main concepts. regards, mingfai
