On Tue, 2009-06-23 at 14:15 +0800, Mingfai wrote:
> hi,
>
> On Tue, Jun 23, 2009 at 5:23 AM, Ryan McKinley <[email protected]> wrote:
>
> >
> >
> > For background, I am not (yet) using droids for web crawling -- rather, I
> > use it to manage a bunch of jobs that keep external processes running. It
> > is easy to equate droids with crawling, but I think that is one of many
> > functions (though obviously the most generally relevant)
>
>
> notice that all of my proposed idea refers to crawling. (and i only use it
> for web crawling) so some points are actually invalid, e.g. i suggest to
> remove Task and TaskMaster. They may be valid for core as a generic robot
> framework.
Hmm, Tasks can be links or real jobs to do. The concept of the task
master is to be a central unit to decide the next task to do and give
information about the current task. Crawler or not this central control
unit would work for both.
> btw, from the "what is it" in the incubator page, "intelligent standalone
> robot framework" indeed vague to me. other than "framework" that clearly
> mean it is not a complete application, the other 3 terms are not clear to
> me. does "standalone" refer to it's not distributed/clustered?
For now droids clustering is not supported but in the future I hope some
usecases will bring to that level. The standalone is thought to reflect
that you can generate different droids/robots that are working alone
without having to invoke them within the framework.
> I actually
> expect it Droids to provide a clustered infrastructure to run Droid /
> execute Task. Say, for web crawling, one could easily get to the bottleneck
> by adding more threads, and a cluster environment is needed.
I guess it should not be hard to deploy droids to an clustered
environment.
>
> p.s. the description indeed sounds like the project is for Artificial
> Intelligence robot.
>
jeje in the end having ai-droids is a big dream. ;)
>
> > Does the *core* really need access to the whole object graph? I totally
> > agree that most specific implementations will need broader access.
> >
> > I think droids power will come from its flexibility / simplicity. Ideally
> > the *core* will have as few dependencies as possible.
> >
> > I agree that sub-project/package that focuses on web crawling could depend
> > on spring.
>
>
> how about split any crawler functionality to a sub-module? and what should
> stay in the core and what should be in the crawler module? e.g. fetcher and
> parser, a generic robot may not do fetching and parsing.
> say, we may define, if certain functionality is needed by more than one
> module, it should go to the core.
>
> Or we could just put everything in core first, and have a concept in mind
> that we'll do splitting, and split later. It's good to start simple.
>
The idea of the "framework" is to offer the different components like
fetcher, parser, queues, etc. to be used in the different droids. For
now I think we should have abstract and simple components in the core
and more dependency heavy components should go into a module of its
one.
>
>
> >
> > 4. Link-centric design
> >> - Link, extends HashMap, will act as a main arbitary data container, and
> >> a vehicle that store attributes and data thoughout the whole lifecycle
> >> of
> >> fetching, parsing, and extracting.
> >>
> >
> > I don't have any strong opinion here.... but I would rather see an API
> > where we can rely on method calls then putting stuff into a Map -- perhaps
> > years of dealing with request.getAttribute() has turned me sour on this
> > model.
>
>
>
> more elaboration at this point, that is mainly for the crawler use case:
>
> - for the crawling use case, I propose to make every component (e.g.
> fetcher, parser) use to a <? extends Link> signature, e.g.
> PaserFactory<T extends Link> with newParser(T link);
> Parser<T extends Link> with parse(T link, Entity entity);
>
> The raw Link is basically a HashMap. but user that extend Link do not
> need to use any Map interface and they could implement every method as Java
> methods, e.g.
> public class EnhancedLink extends Link{
> protected Set<Link> outLinks;
> //.. getter and setter
> }
>
> And they could implement their own Extractor<? extends Link> that use
> setOutLink() to store outlink to their own link.
>
> - refer to DROIDS-52 (https://issues.apache.org/jira/browse/DROIDS-52),
> we would want to store minimum data in a Link. Unless one implement a Queue
> that support passivation, keep putting Link/LinkTask to the Queue/TaskQueue
> will consume a lot of memory.
>
> - The difficult to define normal interface is it's hard to define the
> interface that will be used in difficult components, and it's not possible
> to foresee what data user want to attach to the Link. Assume the following
> crawler flow : polling (a link from queue) -> fetching -> parsing ->
> extracting
>
> Take an example from a real use case that I've encountered before. In
> fetching, the fetcher has Request and Response. The response contains HTTP
> headers including Cookie headers. In out link extraction, we may want to
> create a link with the cookie data. it's no good to pass the response
> object
> all the way to the extractor (and it might not be possible when response is
> not serializable) if it is not a Map like container. Link may have to have
> a
> List<HTTPHeader> to handle my requirement.
Yeah I see your point and agree.
salu2
--
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>
Sociedad Andaluza para el Desarrollo de la Sociedad
de la Información, S.A.U. (SADESI)