On Sat, 2002-03-02 at 19:10, Hal�csy P�ter wrote: > > > -----Original Message----- > > From: Andrew C. Oliver [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, February 26, 2002 2:13 PM > > To: Lucene Developers List > > Subject: Re: Proposal for Lucene / new component > > > > > > Humm. Well said. I'm not against using Avalon. My approach to > > software is this though: Get a working draft. Refactor it into that > > *stand the test of time* for your second or third release. Things > > change...iterate. Not against a super configurable masterpiece...but > > first I want to crawl and index web pages over httpd in various > > pluggable mime formats.. Once we get there... > > > > Hello, > I had been abroad last week and it took at least 30 min to read the discussion about >avalon. It's great! > > Someone mentioned that Avalon is only used by Cocoon. Well, we are using cocoon and >I'm very happy that it is Avalon based. I think that is the main reason of >flexibility. BTW Cocoon uses Lucene, pls refer to >http://xml.apache.org/cocoon/userdocs/generators/search-generator.html > > I think if you need logging, configuring, threading, pooling (for the crawler) and >want to be component based you need a framework some thing like avalon. It took one >day to understand Avalon and write the first Hello world application but you can save >a lot of time while coding. >
Great! Can you post your work to get the Hello Avalon App somewhere? If you could document along those lines as well then I'll be happy to go and write a "getting started" guide for Avalon. I'm not objecting to using Avalon provided I can actually understand it. I'm really close thanks to the fine work of Ken Barrozzi (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/poi/cocoon-poi/), but I'm one step away from actually being about to start using Avalon. Its not a "I won't" its an "I can't" issue. > Iteration is very good practica in software development and can be applied to avalon >based application as well. First you should only write interfaces. First time you can >implement fake component that works like the a real one. After a while you can change >the working component by rewriting the config file. > I kinda believe in writing components that work or do something useful early on. > For example I think the http crawler is built from more than one component: > 1. the fetcher that connects to the webserver, gets the page from the url > responsible for: downloading the page as is (handling network errors), handling HTTP >status codes (for example redirects) > > configurable by: proxy server, max open sockets > > 2. component that parses the fetched page and extract relevant metadata > > 3. a component that is an interface to the loader; it gets the fetched and parsed >pages from the parser (or gets command from the fetcher to delete pages from the >search database) > > this interface can be implemented in several components: > one that puts the data in files (if the loader and the search db is on other box) > one that gives the data to the loader component (that is in the same JVM) > and so on > > 4. one that feeds urls to the crawler's database > responsible for: > extracting links from the dowloaded pages > handling manually submitted urls (submitted by users or sysadmins) > filtering out the exluded urls > > configurable by: excluding rules > awesome, can you patch the proposal with how you propose to do that? > 5. one that reads urls from the database and feed them to the fetcher > the most sophisticated component that responsible for: > choosing the right url to crawl: > - it can use a priority list based on url patterns > - do not fetch a lot of pages from the same server (max 1 request/min) > - robots.txt file > configurable by: priority lists, max urls from a host > > 6. and the last component is the database itself; it can be a JDBC compliant >database or something file system based > responsible for: adding/deleting url to/from the database (url: last fetched date, >last HTTP status code, last action [add or delete]) > aswering host related questions: how many urls were fetched from the host, what time >was the last url fetched, robots.txt of the host > > I know it's not a modell of a working http crawler but please notice: > 1. using avalon you can change the implementation of a component in 30 seconds (if >someone implemented it ;) > 2. you don't have to work on implementing logging, configuration system, database >pooling for JDBC > 3. the crawler is a component that needs no information about the search database >(and the loader/indexer dosn't know the crawler) > 4. the parser and loader interface component can be used in file based HTML crawler >(that reads static HTML pages from the directory of the webserver in [if the engine >is used in intranet]) > 5. having different loader components you can built a search engine for simple JVM >or for distributed system (and you do not need to implement in the first iteration >cycle) > > OK, this mail is already too long and I'm tired. > > peter Cool, my only problems are, if I'm to participate in development involving using Avalon I must understand Avalon; some folks have already written/donated some tremendous code that does some of these things. I'd like to reuse this code -- I'm happy to help refactor it to Avalon...but it goes back to #1. Anyhow, maybe I'm just not skilled enough to grasp Avalon (I've thought it was just a poor-documentation issue). If that prevents me from contributing to this effort in a meaningful way then no big deal. My goal is to help facilitate the work in any way I can. If that means Avalon, fine, but up until now I've mostly failed to get it. If you're able then how about getting us started with some Avalon-esque interfaces? Thanks, -Andy > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > -- http://www.superlinksoftware.com http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document format to java http://developer.java.sun.com/developer/bugParade/bugs/4487555.html - fix java generics! The avalanche has already started. It is too late for the pebbles to vote. -Ambassador Kosh -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
