Re: Avalonized WebCrawler

Otis Gospodnetic Mon, 27 Jan 2003 20:05:27 -0800

Oh, no need to swallow any pride - some of us have been meaning to do
this.....when we have more time...hah.
So just a big thank you from us!


Otis


--- Paul Hammant <[EMAIL PROTECTED]> wrote:
> David,
> 
> Great work.  I sure hope the Lucene peeps can swallow (a little)
> pride 
> and merge the best bits.  It is always difficult receiving a mountain
> of 
> changes...
> 
> I look forward to using some of the componentsoutside Lucene, and the
> 
> whole thing inside Phoenix when you have it ready :-)))
> 
> - Paul H
> (hammant@apache)
> 
> >
> > Lucene developers,
> >
> > This mail follow a few threads which took place 2-3 months ago on
> both 
> > Lucene and Avalon lists:
> >
> > http://marc.theaimsgroup.com/?l=lucene-dev&m=101518595918785&w=2
> > http://marc.theaimsgroup.com/?l=avalon-users&m=103706452017829&w=2
> >
> > They were related to porting the WebCrawler app into a component
> based 
> > application using Avalon. During the past few days, I did just that
> 
> > and I will be happy to share the code with the community. There is 
> > still a lot to do, but my goal was to contact you once the code
> reach 
> > a similar level of development as the one in CVS. I did not contact
> 
> > the list before because I wasn't sure were I was going :), and
> because 
> > I do not have a CVS access at Apache.
> >
> > You can download the code @
> http://67.116.155.180/~wdavidw/crawler.zip
> >
> > Both the sources and binaries are present. On my local environment,
> I 
> > use Maven as the build system. It isn't included in the dowload 
> > because some of the jar I used are recent CVS snapshots not present
> on 
> > the Maven remote location( ibiblio.org). If I am not mistaken, all
> the 
> > required library are present in the zip file.
> >
> > Overall, the code behave just like the present crawler hosted on
> the 
> > Lucene Sandbox repository. Since I mostly did some re-factoring on 
> > this code-base, it will be quite easy for the developer(s) to find
> out 
> > what happens. All the comments, methods, ...., remains the same. I 
> > only changes the most relevant parts. You will find the code
> divided 
> > in 2 packages, the original package "de.lanlab.*" and the new one 
> > "org.crawl.*". The reason behind this separation is that everytime
> I 
> > created a new component, I moved its code into the second package
> for 
> > clarity.
> >
> > As the Avalon container, I choose to use Fortress. It is a stable
> and 
> > almost released container (a matter of weeks). I am seriously
> thinking 
> > about Merlin, but it is no priority for now.
> >
> > Here is a list of the created components/services:
> >
> > fetcher-task-factory
> > host-manager
> > host-resolver
> > url-message-factory
> > web-document-factory
> > message-handler
> > message-listener-selector
> >  . url-length-stage
> >  . url-scope-stage
> >  . robot-exclusion-stage
> >  . url-visited-stage
> >  . known-path-stage
> >  . fetcher-stage
> > storage-pipeline
> > thread-monitor
> > fetcher-thread-factory
> > server-thread-factory
> > url-normalizer
> > url-visited-manager
> > one more to appear: thread-pool-manager
> >
> > Configuration:
> > At this time, every config property is hard coded in the component 
> > class. It will be a fast and easy task to integrate the config file
> 
> > because the component already implement the Avalon configuration 
> > lifecycle.
> >
> > Logging:
> > I had some hard time using fortress logging service. For now, only
> two 
> > logger are working, one for the fortress system, the other for the 
> > crawler. Once i understand where the logging issues is coming from,
> 
> > each component could have his own logger without any code changes.
> >
> > Integration:
> > Fortress can easily be plugged to any time of environment or as a 
> > standalone application. I am planning to write a phoenix block
> soon.
> >
> > Client connection:
> > The current Observer service will change completly. Instead of 
> > printing informations to the console, it will export some sort of 
> > application state descriptor object via AltRMI, or anything else.
> It 
> > will be up to the client to render those information.
> >
> > Speed:
> > When running the current code against the Avalonized one, I get
> very 
> > similar speed results. The only difference is that it takes somehow
> 
> > longer for the new one to reach a stable speed (about 15 secondes).
> >
> > Avalon:
> > I kept having a simplistic use of Avalon. For now, I didn't want to
> 
> > use all the tools available. There are few domains were Avalon
> could 
> > provide more functionalities:
> > - the lifestyle handler (both in Fortress and Merlin), which could 
> > replace the usage of factories for example.
> > - the thread library, because I didn't want to change any of the 
> > current code.
> > - the event library, which will reinforce an SEDA architecture.
> >
> > Javadocs:
> > None, I kept the ones present in the past. I will describe every 
> > service in more details soon, when I finish with all the
> refactoring.
> >
> > Lucene:
> > I think Lucene should be separated from the crawler. One could
> easily 
> > write a service which will schedule crawling process and export the
> 
> > results. Then, this service could use those results to
> create/update a 
> > Lucene index.
> >
> > Future:
> > I am committed to pursue the development of the crawler. I hope
> many 
> > current and future developers will follow me. With your consent, I 
> > would likely move this project to SourceForge, but all opinions are
> 
> > welcome.
> >
> > David
> >
> >
> > -- 
> > To unsubscribe, e-mail:   
> > <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail: 
> > <mailto:[EMAIL PROTECTED]>
> >
> >
> >
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Avalonized WebCrawler

Reply via email to