Re: some proposed ideas for Droids

Thorsten Scherler Thu, 09 Jul 2009 02:35:51 -0700

On Mon, 2009-06-22 at 17:23 -0400, Ryan McKinley wrote:
> 
> For background, I am not (yet) using droids for web crawling --  
> rather, I use it to manage a bunch of jobs that keep external  
> processes running.  It is easy to equate droids with crawling, but I  
> think that is one of many functions (though obviously the most  
> generally relevant)
> 
> 
> On Jun 22, 2009, at 11:40 AM, Mingfai wrote:
> 
> > hi,
> >
> > I am some proposed idea for discussion. Some of them are design  
> > principles
> > or concept, and some are more concrete points about design on specific
> > items. The points are as follows:
> >
> > * - indicate items that I consider as major changes
> > **
> >
> >   1. Use of Java Package
> >      - do not use a org.apache.droids.api package. Just put the API
> >      interface in individual package.
> >
> 
> +0  (sure... i have not strong opinion)


+1

> 
> 
> >      - do not use *.impl unless there are 3 or more. (i don't really  
> > have
> >      strong opinion at this point, just don't like to have some impl
> > package with
> >      just one or two class(es))
> >
> 
> +0

+/- 0

> 
> >      2. General coding practice/standard
> >      - use protected instead of private by default. This will allow  
> > users
> >      to extend and replace any of our class easier.
> >
> >      - do not use final for LinkTask/Link. I have a use case that I  
> > want to
> >      extend Link/LinkTask as a JPA Entity Bean. And JPA doesn't  
> > allow a class
> >      without default constructor, i.e. we can't final the URL field.  
> > For the
> >      classes that are not expected to be subclassed by users, it's  
> > ok to use
> >      final.
> >
> >      - Use Java standard interface rather than introducing our own
> >      interface unless there are obvious value-added. e.g. replace
> > TaskQueue with
> >      java.util.Queue<Task>
> >
> 
> +1

+1

> 
> 
> >      - *Use Spring for droids-core
> >   - i.e. allow droids-core depends on Spring
> >
> >         - a droid needs a whole object graph to work. as a  
> > framework, we
> >         want different components be configurable. It's better to
> > rely on an IoC
> >         framework to manage the dependency and configuration. Spring
> > is the most
> >         popular IoC framework. it will also make testing much easier.
> > and user needs
> >         not to change code (but to change xml) if they want to  
> > change certain
> >         behavior of our core classes.
> >
> >         - there are some special benefits of using spring, e.g. it  
> > supports
> >         annotation and autowiring. Take an example, I could define a
> > field like
> >         "@Autowired Collection<Filter> filters", and when I add a new
> > filter, then,
> >         i create 2 classes like "@Component public class XXXFilter",
> > and in runtime,
> >         the filters field will be injected with a collection of the 2
> > classes. It
> >         makes development and configuration real simple. (and there
> > are also ways to
> >         change the autowiring behavior)
> >
> >         - Spring facilitates the use of http remoting. And it is  
> > easy to
> >         replace implementation class to do remoting or other  
> > interception.
> >
> 
> Does the *core* really need access to the whole object graph?  I  
> totally agree that most specific implementations will need broader  
> access.
> 
> I think droids power will come from its flexibility / simplicity.   
> Ideally the *core* will have as few dependencies as possible.
> 
> I agree that sub-project/package that focuses on web crawling could  
> depend on spring.

I remember that we once had a dependency on spring from the core.
However Oleg had shown with the simpleRuntime that we do not need to
rely on spring in the core. I agree with Ryan and would like to
"outsource" this dependency to a component of its one. 

> 
> 
> 
> >         3. Specific concept/component. Some of the points in this  
> > section
> >   are just my comment to the concept, but not any proposal for action.
> >   - Droids/Crawler
> >         - Our top level concept. We only use Droid but not Crawler.  
> > I use
> >         the generic term Crawler in this message.
> >
> 
> Ya -- the term "droid" was intentionally chosen so that it represents  
> the larger concept of a robot doing something.  Crawling is one  
> instance of what it may do.  (again, likely the most broadly used)

yupp. In my usecase I am using a droid to execute a fixed list of task
and another to crawler a certain page. 

> 
> 
> >         - Link, LinkTask, Task
> >      - Task is a valid concept. A task is the unit that work by the  
> > worker.
> >         A link refer to the link only. I have no objection to this
> > concept. But in
> >         implementation, it seems there is no much need to implement  
> > the Task
> >         concept. (naming an interface as "nextTask" is ok but it
> > seems no need to
> >         have a class or interface called "Task")
> >
> >         - a crawler works with links.and we don't normally non-link  
> > related
> >         task that goes beyond the scope of a crawler framework.

Droids wants to support the non crawler usecases as well, to reuse the
different components.

> >
> >         - See the Link-centric design bullet for more info about link
> >
> >         - Fetcher
> >         - we do not use the concept of Fetcher now. I suppose it is  
> > because
> >         Droids is designed to do more than web crawling and non-web
> > resource is not
> >         "fetched"? In Droids 0.01, Protocol basically represent the
> > Fetcher (or at
> >         least, Protocol+Worker)
> 
> +1 -- I think a Fetcher concept is a good idea.  It should also be  
> independent of the task interface.  When thinking about the fetcher,  
> it may be good to consider VFS (http://commons.apache.org/vfs/) as a  
> first class implementation.

I think I get the fetcher concept (getting a representation of the
underlying task) but not sure how vfs fits the picture.

> 
> 
> >
> >         - I strongly think we should use the term and concept of  
> > Fetcher
> >         because it is a common terminology in crawler. Using common  
> > terms and
> >         language makes our design more intuitive.
> >
> >         - Parser, Handler, Processor, Extractor etc. these are terms  
> > that
> >      share very similar meaning. No matter how we use them, we need  
> > to give a
> >      strict definition, e.g. in class level JavaDoc comment. My  
> > suggestions:
> >         - Parser - the component that process the raw fetched Entity.
> >         Output data is subject to implementation. One Entity will be
> > parsed by one
> >         parser only.
> >
> >         - Extractor - the component to extract out link from entity.
> >         Multiple extractors could be used for a parser. the primary
> > function is to
> >         extract out link. user may also use it to do other extraction
> > or operation,
> >         e.g. to store data in the Link, or just consume the parsed
> > data. A extract
> >         depends directly to a parser. (we can't easily define a
> > contract between
> >         Parser and Extractor, so let's do not attempt this.)
> >
> >         Extractor is a new concept. It is splitted from Parser and  
> > diff in
> >         a way that each link shall be parsed once, and multiple  
> > extractor may
> >         perform extraction or custom operation against the parsed
> > data. I think i
> >         mentioned in another email before. Say when we use
> > NekoHtmlParser, we want
> >         to parse just once, and maybe extractor1 is for extracting
> > outlink, and
> >         extractor2 is for custom processing (such as indexing) and
> > both are based on
> >         the same parsed data.
> >
> >         - Processor - too vague. do not use this concept, and we are  
> > not
> >         using it anyway.
> >
> >         - Handler - for event based Parser, it may use a SAX
> >         DefaultHandler. To avoid confusion, let's not to use handler  
> > in other
> >         context.
> 
> agree.
> 
> "Parser" and "Extractor" make sense,
> 
> "Processor" and "Handler" are not clear to me.  I know they each have  
> functions that can be reused by each other, but the general terms get  
> confusing.
> 

Hmm, http://www.thefreedictionary.com/handler in a parsing concept I
agree is connected with SAX but in our context it aims to express that
the task now can be passed over to the final stage. 

To get it straight you recommend that we drop handlers and only have
parser and extractor? The problem I see then is that the parsing stage
would be mandatory since its merges from there, right?

> >
> >         - Entity
> >         - I don't have any strong proposal and this section is just to
> >         brainstorm some ideas. It would be good to clarify what we
> > want to achieve
> >         in providing an Entity hierarchy, given that, the HC project  
> > actually
> >         provides all the Entity already. Our entity is kind of a  
> > wrapper with
> >         buffer.(and HC also have buffered entity) I guess we don't
> > want to depend on
> >         HttpEntity from HC directly as Droids may touch entity beyond
> > HttpEntity,
> >         e.g. File (but HC also as FileEntity...)

This part has been implement by Oleg directly. 

> >
> >         - Entity is the contract between Fetcher/Protocol and  
> > Parser. For
> >         Entity, it's unlikely the user need to subclass it. if they
> > need to subclass
> >         it, they also need to implement a Fetcher/Protocol + Parser.
> > The value of
> >         sub-classing Entity is not significant. I suggest we just not
> > design it for
> >         subclassing.
> >
> >         - Currently, we have a hierarchy of ManagedContentEntity,
> >         ContentEntity, FileContentEntity, HttpContentEntity. For a
> > file parser and a
> >         http parser, they can't easily use a common Entity interface
> > and I suppose
> >         the parser implementation has to cast the entity. To me,
> >         "ManagedContentEntity" doesn't give a lot of meaning than  
> > "Entity".
> >         FileEntity and HttpEntity does make a different to me in
> > concept, but i
> >         don't see how they could be related in implementation.
> >
> >         - My initial thought about the contract of parse is whether  
> > it can
> >         just take a InputStream. And later i find it is necessary to
> > have a concept
> >         of Entity that hold information like content/mime type,
> > encoding/charset,
> >         size/length etc. But diff kind of entity just may have
> > different attribute
> >         and it's not easy to define a comment contract. One of the
> > ideas in my mind
> >         is to use a single final Entity class that extend HashMap.
> >
> >         - For HttpEntity, i do prefer to have a way to retrieve the
> >         original HC HttpEntity object. (but it is unlikely we want to
> > expose that in
> >         any interface) Notice that the wrapping make it a bit more  
> > complex in
> >         constructing instance in unit test.
> >
> 
> 
> I will defer to others on the Entity discussion...  I am not really  
> familiar with the concepts

Like said above the entity part of droisd had been implemented by oleg
which is most familiar with the concept.


> 
> >         - Worker, Task, TaskMaster
> >         - Make worker implements Runnable, Future.(and not  
> > RunnableFuture
> >         for JDK5 compataibility) and we use run() as its main
> > interface. So it could
> >         be use as a thread easier.
> >
> 
> sure

okay

> 
> >         - I suggest to remove the concept of Task and TaskMaster. A
> >         Droid/Crawler could do most work of the TaskMaster. These
> > concepts also
> >         confuse with Thread, ThreadFactory, Executor, that creates
> > many similar
> >         concepts.
> >
> 
> maybe -- right now, the Droid interface just handles initialization  
> and callbacks from the TaskMaster.  It seems like that is a  
> substantially different concept then keeping a bunch of processes  
> running tasks.
> 

See my comments in the other mail.

> >         - if we keep TaskMaster, i suggest to make it implements
> >         ExecutorService, and we depends on Java util/concurrency API
> > rather than a
> >         TaskMaster interface.
> >
> 
> seems good.
> 
> 
> >         - Queue
> >         - I suggest to remove the TaskQueue interface and use Queue<?
> >         extends Link> as standard signature.
> >
> 
> +1

+1

> 
> 
> >         4. Link-centric design
> >   - Link, extends HashMap, will act as a main arbitary data  
> > container, and
> >      a vehicle that store attributes and data thoughout the whole  
> > lifecycle of
> >      fetching, parsing, and extracting.
> 
> I don't have any strong opinion here.... but I would rather see an API  
> where we can rely on method calls then putting stuff into a Map --  
> perhaps years of dealing with request.getAttribute() has turned me  
> sour on this model.
> 
> >
> >      - if we do it extremely, all data can be stored as in the Link  
> > and all
> >      interface could just use a single Link argument, e.g.  
> > parse(Link),
> >      extract(Link). For sure it is not a good idea. So i make every
> > interface to
> >      include the Link argument as well as key component. I found the  
> > extreme
> >      usage is good in remote web service api, but not good in Java  
> > API.
> >
> >      - All components to be generic as <? extends Link>, user may use
> >      another Link implementation for the whole Droid operation. an
> > example is a
> >      WeightedLink.
> >
> >      - For a Link, only the URL is mandatory. A ID is needed for
> >      implementing an in-memory set/hashtable to reject duplicated
> > Link quickly. I
> >      suggest to make Link a class so people could create a link with  
> > new Link("
> >      http://www.apache.org";) easily, just like creating URL or URI.

sounds reasonable. 

> >      5. Non-thread safe interface and fluent API
> >   - take an example, insteaad of "Parse parse()", i suggest the  
> > parser to
> >      store the parsed data inside itself, and we provide a reset()  
> > method to
> >      clear the data for re-use. This design has pro and con.
> >
> >      - one of main pro is, we could simplify the model by omitting a  
> > Parse
> >      class that is mainly for holding arbitrary data. And we also  
> > can't easily
> >      define the return type of an interface. Take Fetcher as an  
> > example, a
> >      fetcher typically contain a Request and Response object. Should  
> > we have
> >      fetch() to return a FetchedData that has request, response, and
> > entity? it's
> >      just a bit complex.
> >
> >      - I hope no one against Fluent API :-). with fluent api, it's  
> > like
> >      "public Fetcher fetch()". And I don't always use Fluent Api,
> > only use when
> >      it is good and the api call may be chained. e.g.  
> > parser.parse().getDate()
> >

No strong feeling about it.

> >      6. Factory and LinkMatcher design
> >      - For worker, fetcher and parser, they are provided by users as a
> >      Factory.
> >
> >      - For FetcherFactory and ParserFactory, new instance are  
> > created with
> >      a newXXX(Link link)
> >
> >      - So, depends on the Link, the Factory will provide diff  
> > components.
> >      e.g. for http link, it's a HttpFetcher, for File, it's a file
> > fetcher (not
> >      impl.) for parer, it consults the content type.
> >
> >      - Every component implemnets a LinkMatcher interface that  
> > checks if a
> >      Fecher/Parser/Extractor supports a particular link. This is  
> > primary for
> >      automatic component registration without a need to explicitly  
> > providing a
> >      mapping upfront. e.g. there might be a PNG parser that checks the
> >      "contentType" attribute of the Link. The parser implemented the  
> > matches()
> >      method. so we don't need to maintain a mapping hashmap between
> > contentType
> >      and parser. The link matching may be complex so it's hard to use
> > a mapping
> >      hashmap anyway. together with the filter framework, any
> > attribute could be
> >      prepared by a filter first, so the factory could always rely on
> > the matcher
> >      interface to find the correct parser/fetcher.
> >
> 
> no real opinion -- everything sounds reasonable.

dito


> 
> >      7. *Filter Framework
> >      - This is a significant new concept. I propose to have a filter
> >      framework that works as a chain for intercepting the works of  
> > every main
> >      component. There are a main lifecycle filter that is named
> > Filter, and also
> >      individual component filters such as FetchFilter and
> > ParseFilter. Lifecycle
> >      filter is called by a Worker. Some works may not support it,  
> > e.g. my
> >      WebServiceWorker that call GAE service do the whole batch of
> >      fetch->parser->extract in one go, so there is no local filter.
> > Normal worker
> >      shall call every filter after every operation. If the filter
> > return null, it
> >      stop processing the link
> >
> >      - Filter
> >         - When we have a confirmed lifecycle, e.g. poll a link from  
> > queue
> >         -> fetch entity -> parse entity -> extract outlinks, then we
> > have a filter
> >         that allow inteception in between every stage. e.g.
> >         public Link polled(Link link)
> >         public Fetcher fetched(Link link, Fetcher fetcher)
> >
> >         - any filter may influence the flow by changing the component
> >         object like Fetcher/Parser, or they may return null and the
> >         Worker/TaskMaster shall stop the process for that link.
> >         - e.g. Duplicated Link handling could be done as a Filter. a
> >         singleton NoRepeatFilter stores a Set of Link ID, and when  
> > any link is
> >         extracted, it is check against the set and dupliated link
> > will be removed.
> >         - It offers a lot of potentially such as providing runtime
> >         statistics.
> >
> >         - Component filter
> >         - e.g. FetchFilter, public void preFetch(Link, Fetcher),
> >         postFetch(Link,Fetcher)
> >         - component filter is expected to alter fetching behavior.  
> > e.g. for
> >         preFetch for a http fetcher, the http request shall be
> > available already,
> >         and the preFetch could as the fetcher to the concrete class,
> > and modify the
> >         content of the HttpRequest before it is executed. e.g. to
> > append http header
> >         / cookie depends on any attribute in the Link.
> >         - The global / lifecycle filter do filter after every  
> > component
> >         operations. they are designed for different purpose.
> >
> 
> sounds good


agree

> 
> 
> 
> >         8. Removed concepts based on the above proposal
> >   - LinkTask, Task, just keep Link

I would prefer to keep task since it is more generic.

> >      - TaskMaster - with some refactor to assign responsibility to  
> > Droids
> >      and Worker, the TaskMaster doesn't do too many things, and I  
> > suggest to
> >      remove this concept.

Hmm, see my different comments about Taskmaster but IMO the taskMaster
concept is generic and valuable. 

> >      - TaskQueue - just use Queue<Link>

+1

> >      - TaskQueueWithHistory - this is eliminated by the filter  
> > framework.
> >      See the next section.
> >      - TaskValidator - eliminate by the filter framework / implement  
> > as a
> >      Filter
> >      - URL Filter - could be implemented as a Filter
> >      - Parse(Parse) - merged into Parser, I think "Parse" is a vague
> >      concept and we would rather to have a Map return from than have  
> > a Parse
> >      class
> >
> 
> ya.
> 
> >
> > any comment?
> 
> 
> 
> In general sounds good.  As for flushing out large changes like this  
> -- i think we should discuss it a bit more to make sure everyone is on  
> the same page.  Then it probably makes sense to start a personal  
> sandbox:
> http://svn.apache.org/repos/asf/incubator/droids/sandbox/mingfa
> where we can see some things in action and then look at migrating  
> things together.


Yeah this way we can as well promote the proven code to the main tree.

Thanks for the feedback.

salu2
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)

Re: some proposed ideas for Droids

Reply via email to