>
>
> -
>
> 1. *Filter Framework
> - This is a significant new concept. I propose to have a filter
> framework that works as a chain for intercepting the works of every main
> component. There are a main lifecycle filter that is named Filter, and
> also
> individual component filters such as FetchFilter and ParseFilter.
> Lifecycle
> filter is called by a Worker. Some works may not support it, e.g. my
> WebServiceWorker that call GAE service do the whole batch of
> fetch->parser->extract in one go, so there is no local filter. Normal
> worker
> shall call every filter after every operation. If the filter return
> null, it
> stop processing the link
>
> - Filter
> - When we have a confirmed lifecycle, e.g. poll a link from queue
> -> fetch entity -> parse entity -> extract outlinks, then we have a
> filter
> that allow inteception in between every stage. e.g.
> public Link polled(Link link)
> public Fetcher fetched(Link link, Fetcher fetcher)
>
> - any filter may influence the flow by changing the component
> object like Fetcher/Parser, or they may return null and the
> Worker/TaskMaster shall stop the process for that link.
> - e.g. Duplicated Link handling could be done as a Filter. a
> singleton NoRepeatFilter stores a Set of Link ID, and when any link
> is
> extracted, it is check against the set and dupliated link will be
> removed.
> - It offers a lot of potentially such as providing runtime
> statistics.
>
> - Component filter
> - e.g. FetchFilter, public void preFetch(Link, Fetcher),
> postFetch(Link,Fetcher)
> - component filter is expected to alter fetching behavior. e.g.
> for preFetch for a http fetcher, the http request shall be available
> already, and the preFetch could as the fetcher to the concrete
> class, and
> modify the content of the HttpRequest before it is executed. e.g. to
> append
> http header / cookie depends on any attribute in the Link.
> - The global / lifecycle filter do filter after every component
> operations. they are designed for different purpose.
>
>
>
The proposed Filter framework is significantly revised. A diagram worths a
thousand words:
http://people.apache.org/~mingfai/model/Filter%20Class%20Diagram.png<http://people.apache.org/%7Emingfai/model/Filter%20Class%20Diagram.png>
http://people.apache.org/~mingfai/model/Filter%20Sequence%20Diagram.png<http://people.apache.org/%7Emingfai/model/Filter%20Sequence%20Diagram.png>
(notice that the diagram may be changed in any time)
Additional notes:
- Every component has an abstract implementation, e.g. AbstractParser,
that implements FilterSupport and provide a method to call every registered
filter
- Filter configuration could be done by setting a List of filters to the
component, or if we can depends on spring, the abstract implementation could
be defined to use @Autowired List<XXXFilter> filters; so user just need to
annotate their Filter (and include in the Spring component scan) to register
a filter.
- Filters are sensitive to order. A machanism is provided to sort the
filters. e.g. for AbstractFetcher
- it has an optional autowired comparator:
@Autowired @Qualifier("fetcher.filterComparator") protected Comparator
filterComparator
- it has a @PostConstruct init() method that checks if there is a
filterComparator, if yes, then it sorts the filters List
- a WeightComparator is provided and I expect most people will just
use that. The WeightComparactor simply checks if the two objects
implement
Weighted (that has a getWeight():int) and use the weight to decide the
order.
i've published a JavaDoc:
http://people.apache.org/~mingfai/javadoc/<http://people.apache.org/%7Emingfai/javadoc/>(also
subject to change in any time)
and corresponding source code:
https://svn.apache.org/repos/asf/incubator/droids/sandbox/mingfai/src/main/java/org/apache/crawler/
Notice that current code in svn are more for reading than running. I made
quick global change to the package names and other configurations and some
things may be broken. It's heavily Spring-based and most test cases are
written in Groovy. if we want to use them directly, it's better to put them
to in sub-module as it's not compatible with the current droids-core (even
though the concept are very similar)
regards,
mingfai