>
>
>    -
>
>       1. *Filter Framework
>       - This is a significant new concept. I propose to have a filter
>       framework that works as a chain for intercepting the works of every main
>       component. There are a main lifecycle filter that is named Filter, and 
> also
>       individual component filters such as FetchFilter and ParseFilter. 
> Lifecycle
>       filter is called by a Worker. Some works may not support it, e.g. my
>       WebServiceWorker that call GAE service do the whole batch of
>       fetch->parser->extract in one go, so there is no local filter. Normal 
> worker
>       shall call every filter after every operation. If the filter return 
> null, it
>       stop processing the link
>
>       - Filter
>          - When we have a confirmed lifecycle, e.g. poll a link from queue
>          -> fetch entity -> parse entity -> extract outlinks, then we have a 
> filter
>          that allow inteception in between every stage. e.g.
>          public Link polled(Link link)
>          public Fetcher fetched(Link link, Fetcher fetcher)
>
>          - any filter may influence the flow by changing the component
>          object like Fetcher/Parser, or they may return null and the
>          Worker/TaskMaster shall stop the process for that link.
>          - e.g. Duplicated Link handling could be done as a Filter. a
>          singleton NoRepeatFilter stores a Set of Link ID, and when any link 
> is
>          extracted, it is check against the set and dupliated link will be 
> removed.
>          - It offers a lot of potentially such as providing runtime
>          statistics.
>
>          - Component filter
>          - e.g. FetchFilter, public void preFetch(Link, Fetcher),
>          postFetch(Link,Fetcher)
>          - component filter is expected to alter fetching behavior. e.g.
>          for preFetch for a http fetcher, the http request shall be available
>          already, and the preFetch could as the fetcher to the concrete 
> class, and
>          modify the content of the HttpRequest before it is executed. e.g. to 
> append
>          http header / cookie depends on any attribute in the Link.
>          - The global / lifecycle filter do filter after every component
>          operations. they are designed for different purpose.
>
>
>
The proposed Filter framework is significantly revised. A diagram worths a
thousand words:
http://people.apache.org/~mingfai/model/Filter%20Class%20Diagram.png<http://people.apache.org/%7Emingfai/model/Filter%20Class%20Diagram.png>
http://people.apache.org/~mingfai/model/Filter%20Sequence%20Diagram.png<http://people.apache.org/%7Emingfai/model/Filter%20Sequence%20Diagram.png>
(notice that the diagram may be changed in any time)

Additional notes:

   - Every component has an abstract implementation, e.g. AbstractParser,
   that implements FilterSupport and provide a method to call every registered
   filter

   - Filter configuration could be done by setting a List of filters to the
   component, or if we can depends on spring, the abstract implementation could
   be defined to use @Autowired List<XXXFilter> filters; so user just need to
   annotate their Filter (and include in the Spring component scan) to register
   a filter.

   - Filters are sensitive to order. A machanism is provided to sort the
   filters. e.g. for AbstractFetcher
      - it has an optional autowired comparator:
      @Autowired @Qualifier("fetcher.filterComparator") protected Comparator
      filterComparator
      - it has a @PostConstruct init() method that checks if there is a
      filterComparator, if yes, then it sorts the filters List
      - a WeightComparator is provided and I expect most people will just
      use that. The WeightComparactor simply checks if the two objects
implement
      Weighted (that has a getWeight():int) and use the weight to decide the
      order.

i've published a JavaDoc:
http://people.apache.org/~mingfai/javadoc/<http://people.apache.org/%7Emingfai/javadoc/>(also
subject to change in any time)
and corresponding source code:
https://svn.apache.org/repos/asf/incubator/droids/sandbox/mingfai/src/main/java/org/apache/crawler/

Notice that current code in svn are more for reading than running. I made
quick global change to the package names and other configurations and some
things may be broken. It's heavily Spring-based and most test cases are
written in Groovy. if we want to use them directly, it's better to put them
to in sub-module as it's not compatible with the current droids-core (even
though the concept are very similar)


regards,
mingfai

Reply via email to