Avalonized WebCrawler

David Worms Mon, 27 Jan 2003 15:03:56 -0800

Lucene developers,

This mail follow a few threads which took place 2-3 months ago on both Lucene and Avalon lists:

http://marc.theaimsgroup.com/?l=lucene-dev&m=101518595918785&w=2
http://marc.theaimsgroup.com/?l=avalon-users&m=103706452017829&w=2

They were related to porting the WebCrawler app into a component based application using Avalon. During the past few days, I did just that and I will be happy to share the code with the community. There is still a lot to do, but my goal was to contact you once the code reach a similar level of development as the one in CVS. I did not contact the list before because I wasn't sure were I was going :), and because I do not have a CVS access at Apache.

You can download the code @ http://67.116.155.180/~wdavidw/crawler.zip

Both the sources and binaries are present. On my local environment, I use Maven as the build system. It isn't included in the dowload because some of the jar I used are recent CVS snapshots not present on the Maven remote location( ibiblio.org). If I am not mistaken, all the required library are present in the zip file.

Overall, the code behave just like the present crawler hosted on the Lucene Sandbox repository. Since I mostly did some re-factoring on this code-base, it will be quite easy for the developer(s) to find out what happens. All the comments, methods, ...., remains the same. I only changes the most relevant parts. You will find the code divided in 2 packages, the original package "de.lanlab.*" and the new one "org.crawl.*". The reason behind this separation is that everytime I created a new component, I moved its code into the second package for clarity.

As the Avalon container, I choose to use Fortress. It is a stable and almost released container (a matter of weeks). I am seriously thinking about Merlin, but it is no priority for now.

Here is a list of the created components/services:

fetcher-task-factory
host-manager
host-resolver
url-message-factory
web-document-factory
message-handler
message-listener-selector
. url-length-stage
. url-scope-stage
. robot-exclusion-stage
. url-visited-stage
. known-path-stage
. fetcher-stage
storage-pipeline
thread-monitor
fetcher-thread-factory
server-thread-factory
url-normalizer
url-visited-manager
one more to appear: thread-pool-manager

Configuration:
At this time, every config property is hard coded in the component class. It will be a fast and easy task to integrate the config file because the component already implement the Avalon configuration lifecycle.

Logging:
I had some hard time using fortress logging service. For now, only two logger are working, one for the fortress system, the other for the crawler. Once i understand where the logging issues is coming from, each component could have his own logger without any code changes.

Integration:
Fortress can easily be plugged to any time of environment or as a standalone application. I am planning to write a phoenix block soon.

Client connection:
The current Observer service will change completly. Instead of printing informations to the console, it will export some sort of application state descriptor object via AltRMI, or anything else. It will be up to the client to render those information.

Speed:
When running the current code against the Avalonized one, I get very similar speed results. The only difference is that it takes somehow longer for the new one to reach a stable speed (about 15 secondes).

Avalon:
I kept having a simplistic use of Avalon. For now, I didn't want to use all the tools available. There are few domains were Avalon could provide more functionalities:
- the lifestyle handler (both in Fortress and Merlin), which could replace the usage of factories for example.
- the thread library, because I didn't want to change any of the current code.
- the event library, which will reinforce an SEDA architecture.

Javadocs:
None, I kept the ones present in the past. I will describe every service in more details soon, when I finish with all the refactoring.

Lucene:
I think Lucene should be separated from the crawler. One could easily write a service which will schedule crawling process and export the results. Then, this service could use those results to create/update a Lucene index.

Future:
I am committed to pursue the development of the crawler. I hope many current and future developers will follow me. With your consent, I would likely move this project to SourceForge, but all opinions are welcome.

David

--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Avalonized WebCrawler

Reply via email to