On Sat, 2002-03-02 at 19:10, Hal�csy P�ter wrote:
> 
> > -----Original Message-----
> > From: Andrew C. Oliver [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, February 26, 2002 2:13 PM
> > To: Lucene Developers List
> > Subject: Re: Proposal for Lucene / new component
> > 
> > 
> > Humm.  Well said.  I'm not against using Avalon.  My approach to
> > software is this though:  Get a working draft.  Refactor it into that
> > *stand the test of time* for your second or third release.  Things
> > change...iterate.  Not against a super configurable masterpiece...but
> > first I want to crawl and index web pages over httpd in various
> > pluggable mime formats.. Once we get there...
> > 
> 
> Hello,
> I had been abroad last week and it took at least 30 min to read the discussion about 
>avalon. It's great!
> 
> Someone mentioned that Avalon is only used by Cocoon. Well, we are using cocoon and 
>I'm very happy that it is Avalon based. I think that is the main reason of 
>flexibility. BTW Cocoon uses Lucene, pls refer to 
>http://xml.apache.org/cocoon/userdocs/generators/search-generator.html
> 
> I think if you need logging, configuring, threading, pooling (for the crawler) and 
>want to be component based you need a framework some thing like avalon. It took one 
>day to understand Avalon and write the first Hello world application but you can save 
>a lot of time while coding.
> 

Great!  Can you post your work to get the Hello Avalon App somewhere? 
If you could document along those lines as well then I'll be happy to go
and write a "getting started" guide for Avalon.  

I'm not objecting to using Avalon provided I can actually understand
it.  I'm really close thanks to the fine work of Ken Barrozzi 
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/poi/cocoon-poi/), but
I'm one step away from actually being about to start using Avalon.  Its
not a "I won't" its an "I can't" issue.  


> Iteration is very good practica in software development and can be applied to avalon 
>based application as well. First you should only write interfaces. First time you can 
>implement fake component that works like the a real one. After a while you can change 
>the working component by rewriting the config file.
> 

I kinda believe in  writing components that work or do something useful
early on.  

> For example I think the http crawler is built from more than one component:
> 1. the fetcher that connects to the webserver, gets the page from the url
> responsible for: downloading the page as is (handling network errors), handling HTTP 
>status codes (for example redirects)
> 
> configurable by: proxy server, max open sockets
> 
> 2. component that parses the fetched page and extract relevant metadata
> 
> 3. a component that is an interface to the loader; it gets the fetched and parsed 
>pages from the parser (or gets command from the fetcher to delete pages from the 
>search database)
> 
> this interface can be implemented in several components:
> one that puts the data in files (if the loader and the search db is on other box)
> one that gives the data to the loader component (that is in the same JVM)
> and so on
>  
> 4. one that feeds urls to the crawler's database 
> responsible for: 
> extracting links from the dowloaded pages
> handling manually submitted urls (submitted by users or sysadmins)
> filtering out the exluded urls
> 
> configurable by: excluding rules
> 

awesome, can you patch the proposal with how you propose to do that?

> 5. one that reads urls from the database and feed them to the fetcher
> the most sophisticated component that responsible for: 
> choosing the right url to crawl:
>  -  it can use a priority list based on url patterns
>  - do not fetch a lot of pages from the same server (max 1 request/min)
>  - robots.txt file
> configurable by: priority lists, max urls from a host
> 
> 6. and the last component is the database itself; it can be a JDBC compliant 
>database or something file system based
> responsible for: adding/deleting url to/from the database (url: last fetched date, 
>last HTTP status code, last action [add or delete])
> aswering host related questions: how many urls were fetched from the host, what time 
>was the last url fetched,  robots.txt of the host
> 
> I know it's not a modell of a working http crawler but please notice:
> 1. using avalon you can change the implementation of a component in 30 seconds (if 
>someone implemented it ;)
> 2. you don't have to work on implementing logging, configuration system, database 
>pooling for JDBC 
> 3. the crawler is a component that needs no information about the search database 
>(and the loader/indexer dosn't know the crawler)
> 4. the parser and loader interface component can be used in file based HTML crawler 
>(that reads static HTML pages from the directory of the webserver in [if the engine 
>is used in intranet])
> 5. having different loader components you can built a search engine for simple JVM 
>or for distributed system (and you do not need to implement in the first iteration 
>cycle)
> 
> OK, this mail is already too long and I'm tired.
> 
> peter

Cool, my only problems are, if I'm to participate in development
involving using Avalon I must understand Avalon; some folks have already
written/donated some tremendous code that does some of these things. 
I'd like to reuse this code -- I'm happy to help refactor it to
Avalon...but it goes back to #1.

Anyhow, maybe I'm just not skilled enough to grasp Avalon (I've thought
it was just a poor-documentation issue).  If that prevents me from
contributing to this effort in a meaningful way then no big deal.  My
goal is to help facilitate the work in any way I can.  If that means
Avalon, fine, but up until now I've mostly failed to get it.  If you're
able then how about getting us started with some Avalon-esque
interfaces?

Thanks,

-Andy

> 
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
> 
-- 
http://www.superlinksoftware.com
http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document 
                            format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html 
                        - fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to