Hi,
I just want to keep you informed on how we plan to integrate the LARM crawler with Lucene. I'm working with Mehran Mehr on two major topics: 1. Lucene storage: We want to see a web document as a bunch of name-value pairs, one of which is the URL and the other could be the document itself. >From within the storage pipeline, these web documents can be enhanced or changed. In the end there is the Lucene storage which takes a web document and stores its contents as fields within a Lucene index. So the storage itself is stupid. We can think of a lot of preprocessing steps that can occur before the store process itself takes place: document conversion, HTML removal, Header extraction, lemmatization and other linguistic features, and so forth. The storage itself can also be only an intermediary step: web documents could also be saved in plain files or a JMS topic, allowing for the division of the processing steps in a temporal or spacial manner. 2. Configuration. The crawler is very modular and mainly consists of several producer/consumer pipelines that define a way where documents come from and how they are processed. We want this whole pipeline to be configurable (remember, most of it is still done from within the source code). That way, we want to be able to provide different configurations for different purposes: One could mimic the behavior of "wget", for example, another could build a fast one-machine crawler for a medium-size intranet, while a third configuration could be distributed and crawls a major part of the web. As soon as we have done these two things, I think we can move the crawler and Lucene a bit closer together. We are still looking for people to help us. If you have resources left for the further developments (design, code, test), please read the technical overview document and the TODO.txt files in the lucene-sandbox repository, and contact me. --Clemens -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
