Re: some proposed ideas for Droids

Mingfai Tue, 14 Jul 2009 01:50:37 -0700

hi,


> >
> > Does the *core* really need access to the whole object graph?  I
> > totally agree that most specific implementations will need broader
> > access.
> >
> > I think droids power will come from its flexibility / simplicity.
> > Ideally the *core* will have as few dependencies as possible.
> >
> > I agree that sub-project/package that focuses on web crawling could
> > depend on spring.
>
> I remember that we once had a dependency on spring from the core.
> However Oleg had shown with the simpleRuntime that we do not need to
> rely on spring in the core. I agree with Ryan and would like to
> "outsource" this dependency to a component of its one.
>

it's just a little bit more convenient and make the code a little bit
cleaner to directly depend on Spring.(e.g. with Spring's annotation) if not,
as long as we design with a Dependency Injection mindset and make our
components as JavaBean, we still could utilize Spring/any DI framework to
wire the components in XML.

the core probably doesn't need DI if it is just the individual components
like parser, task master etc. For any module that need to wire things up, a
DI framework will help.



>
> > >
> > >         - See the Link-centric design bullet for more info about link
> > >
> > >         - Fetcher
> > >         - we do not use the concept of Fetcher now. I suppose it is
> > > because
> > >         Droids is designed to do more than web crawling and non-web
> > > resource is not
> > >         "fetched"? In Droids 0.01, Protocol basically represent the
> > > Fetcher (or at
> > >         least, Protocol+Worker)
> >
> > +1 -- I think a Fetcher concept is a good idea.  It should also be
> > independent of the task interface.  When thinking about the fetcher,
> > it may be good to consider VFS (http://commons.apache.org/vfs/) as a
> > first class implementation.
>
> I think I get the fetcher concept (getting a representation of the
> underlying task) but not sure how vfs fits the picture.


sounds easy to implement. There will be a VFSFetcher that take a single
String VFS URL as configuration, that return a ContentEntity. (depends on
the interface, or maybe just an inputstream)

I guess we at least create the concept of Fetcher for the crawler use case.
(if not in core)



>
> >
> > agree.
> >
> > "Parser" and "Extractor" make sense,
> >
> > "Processor" and "Handler" are not clear to me.  I know they each have
> > functions that can be reused by each other, but the general terms get
> > confusing.
> >
>
> Hmm, http://www.thefreedictionary.com/handler in a parsing concept I
> agree is connected with SAX but in our context it aims to express that
> the task now can be passed over to the final stage.
>
> To get it straight you recommend that we drop handlers and only have
> parser and extractor? The problem I see then is that the parsing stage
> would be mandatory since its merges from there, right?
>


i mainly suggest we should define the concepts clearly as well as define the
interface and the calling sequence. Whether it is called a handler or parser
or we may parser extends handler, it's all just some names.

in the current snapshot, the Handler interface sounds like a filter or a
listener to me. It has an interface that take a URI and ContentEntity, and
the parser has similar interface. I think we need to define its role
clearly. Is there anything that shouldn't be done by a parser but by a
handler? what's the flow / sequence of using the handler? if a handler works
in a similar way as a parser, I suggest to merge them.

for event-driven parser like SAX XML Parser, it also has a concept of
ContentHandler, which got nothing to do with our Handler interface.

btw, extractor is mainly for crawling.

>
> >
> > I will defer to others on the Entity discussion...  I am not really
> > familiar with the concepts
>
> Like said above the entity part of droisd had been implemented by oleg
> which is most familiar with the concept.
>

make it simple. I just raise the question of we want intend to achieve
though wrapping the HttpEntity from the HC project. i think:

   - imho, HttpEntity object is mainly for holding the inputstream, and
   store some meta data for the inputstream such as content type, content
   length and encoding.

   - just look at the API, as it provides getMimeType(), getCharset(),
   getParse(), mimetype is http content type, charset is encoding. for parse, i
   suppose it stores arbitary data. (btw, I think the concept of Parse should
   be reviewed, too)

   - so, it seems to me the main purpose it just to make it not tie to HTTP.
   conceptually, it's ok. and people who learn Droids will need to learn both
   Droids' entity and HC's entity. I personally prefer to use the HC directly
   if there is no major value-added. The name is not "politically correct" for
   us, but no big deal. (esp I use Droids for web crawling.. :-) )

   - if we make the Link / LinkTask as a Map, then we already have a vehicle
   to store arbitary data. And if we also standardize the interface to take the
   Link/LinkTask instead of URI, we have a place to store any data that need to
   be associated with the Entity/ContentEntity (that always associate with a
   Link/LinkTask)




>
> > >         - if we keep TaskMaster, i suggest to make it implements
> > >         ExecutorService, and we depends on Java util/concurrency API
> > > rather than a
> > >         TaskMaster interface.
> > >
> >
> > seems good.
>

a component to manage the queue, thread, and execution, which is what the
Task Master is designed to be, is needed. I suggest to make in this way:

   - we have the Task Master in the core.
   - for the crawler/web-crawler module, the Crawler may extends a Task
   Master if suitable. (rather than a Crawler has a Task Master)




>
> >
> >
> > >         - Queue
> > >         - I suggest to remove the TaskQueue interface and use Queue<?
> > >         extends Link> as standard signature.
> > >
> >
> > +1
>
> +1
>

basically, many of my proposal are just to remove/make-optional some
interfaces from Droids. So when ppl pick up Droids, they need to learn less
things. TaskQueue -> Queue and TaskMaster->ExecutorService are two examples.




>
> >
> >
> >
> > >         8. Removed concepts based on the above proposal
> > >   - LinkTask, Task, just keep Link
>
> I would prefer to keep task since it is more generic.


how about this, we have:

   - Task in core
      - has an "id" field (in Long?)
      - extends HashMap
      - implements Serializable
      - Link in crawler module
   - extends Task
      - has a "url" field in String (or URI)
      - re the name, ppl who read the interface knows it's a Task anyway.
      calling LinkTask as is is also ok, and people will ask, what exactly the
      Task in the LinkTask name means. For crawling, we may use only
just URL for
      the queue, and Task is just something generic and becomes vague in the
      context of a crawler.


it's understandable the naming of components in a generic robot framework
and a crawler framework are different.


> >
> > >
> > > any comment?
> >
> >
> >
> > In general sounds good.  As for flushing out large changes like this
> > -- i think we should discuss it a bit more to make sure everyone is on
> > the same page.  Then it probably makes sense to start a personal
> > sandbox:
> > http://svn.apache.org/repos/asf/incubator/droids/sandbox/mingfa
> > where we can see some things in action and then look at migrating
> > things together.
>
>
> Yeah this way we can as well promote the proven code to the main tree.
>
> Thanks for the feedback.



I've put some code to the svn already. As I doubt if anyone would be
interested to run that, I treat them as an example to illustrate my
suggestions. It's better just to read the JavaDoc. (
http://people.apache.org/~mingfai/javadoc/ the package is wrong, just ignore
it)

Thanks for the comments.

regards,
mingfai

Re: some proposed ideas for Droids

Reply via email to