On Tue, 2009-06-23 at 14:15 +0800, Mingfai wrote:
> hi,
> 
> On Tue, Jun 23, 2009 at 5:23 AM, Ryan McKinley <[email protected]> wrote:
> 
> >
> >
> > For background, I am not (yet) using droids for web crawling -- rather, I
> > use it to manage a bunch of jobs that keep external processes running.  It
> > is easy to equate droids with crawling, but I think that is one of many
> > functions (though obviously the most generally relevant)
> 
> 
> notice that all of my proposed idea refers to crawling. (and i only use it
> for web crawling) so some points are actually invalid, e.g. i suggest to
> remove Task and TaskMaster. They may be valid for core as a generic robot
> framework.

Hmm, Tasks can be links or real jobs to do. The concept of the task
master is to be a central unit to decide the next task to do and give
information about the current task. Crawler or not this central control
unit would work for both.


> btw, from the "what is it" in the incubator page, "intelligent standalone
> robot framework" indeed vague to me. other than "framework" that clearly
> mean it is not a complete application, the other 3 terms are not clear to
> me. does "standalone" refer to it's not distributed/clustered? 

For now droids clustering is not supported but in the future I hope some
usecases will bring to that level. The standalone is thought to reflect
that you can generate different droids/robots that are working alone
without having to invoke them within the framework. 

> I actually
> expect it Droids to provide a clustered infrastructure to run Droid /
> execute Task. Say, for web crawling, one could easily get to the bottleneck
> by adding more threads, and a cluster environment is needed.

I guess it should not be hard to deploy droids to an clustered
environment. 

> 
> p.s. the description indeed sounds like the project is for Artificial
> Intelligence robot.
> 

jeje in the end having ai-droids is a big dream. ;)

> 
> > Does the *core* really need access to the whole object graph?  I totally
> > agree that most specific implementations will need broader access.
> >
> > I think droids power will come from its flexibility / simplicity.  Ideally
> > the *core* will have as few dependencies as possible.
> >
> > I agree that sub-project/package that focuses on web crawling could depend
> > on spring.
> 
> 
> how about split any crawler functionality to a sub-module? and what should
> stay in the core and what should be in the crawler module?  e.g. fetcher and
> parser, a generic robot may not do fetching and parsing.
> say, we may define, if certain functionality is needed by more than one
> module, it should go to the core.
> 
> Or we could just put everything in core first, and have a concept in mind
> that we'll do splitting, and split later. It's good to start simple.
> 

The idea of the "framework" is to offer the different components like
fetcher, parser, queues, etc. to be used in the different droids. For
now I think we should have abstract and simple components in the core
and more dependency heavy components should go into a module of its
one. 

> 
> 
> >
> >         4. Link-centric design
> >>  - Link, extends HashMap, will act as a main arbitary data container, and
> >>     a vehicle that store attributes and data thoughout the whole lifecycle
> >> of
> >>     fetching, parsing, and extracting.
> >>
> >
> > I don't have any strong opinion here.... but I would rather see an API
> > where we can rely on method calls then putting stuff into a Map -- perhaps
> > years of dealing with request.getAttribute() has turned me sour on this
> > model.
> 
> 
> 
> more elaboration at this point, that is mainly for the crawler use case:
> 
>    - for the crawling use case, I propose to make every component (e.g.
>    fetcher, parser) use to a <? extends Link> signature, e.g.
>    PaserFactory<T extends Link> with newParser(T link);
>    Parser<T extends Link> with parse(T link, Entity entity);
> 
>    The raw Link is basically a HashMap. but user that extend Link do not
>    need to use any Map interface and they could implement every method as Java
>    methods, e.g.
>    public class EnhancedLink extends Link{
>      protected Set<Link> outLinks;
>      //.. getter and setter
>    }
> 
>    And they could implement their own Extractor<? extends Link> that use
>    setOutLink() to store outlink to their own link.
> 
>    - refer to DROIDS-52 (https://issues.apache.org/jira/browse/DROIDS-52),
>    we would want to store minimum data in a Link. Unless one implement a Queue
>    that support passivation, keep putting Link/LinkTask to the Queue/TaskQueue
>    will consume a lot of memory.
> 
>    - The difficult to define normal interface is it's hard to define the
>    interface that will be used in difficult components, and it's not possible
>    to foresee what data user want to attach to the Link. Assume the following
>    crawler flow : polling (a link from queue) -> fetching -> parsing ->
>    extracting
> 
>    Take an example from a real use case that I've encountered before. In
>    fetching, the fetcher has Request and Response. The response contains HTTP
>    headers including Cookie headers. In out link extraction, we may want to
>    create a link with the cookie data. it's no good to pass the response 
> object
>    all the way to the extractor (and it might not be possible when response is
>    not serializable) if it is not a Map like container. Link may have to have 
> a
>    List<HTTPHeader> to handle my requirement.

Yeah I see your point and agree.

salu2
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)




Reply via email to