I see nice progress here. I will try it in the near future (time!). > I have added an experimental version of a LuceneStorage to the LARM > crawler, > available from CVS in lucene-sandbox. That means crawled documents > can now directly be indexed into a lucene index. > > > > Sorry, no configuration files yet. Config is done in > ...larm/FetcherMain.java > The main class FetcherMain is now configured to store the contents in > a lucene index called "luceneIndex". > > > Lots of open questions: > - LARM doesn't have the notion of closing everything down. What > happens if IndexWriter is interrupted?
As in what if it encounters an exception (e.g. somebody removes the index directory)? I guess one of the items that should them maybe get added to the to-do list is checkpointing for starters. > - I haven't tried to read from the index yet... Heh, I'm familiar with that situation. > - How to configure the stuff from a config file > ... (it's late) Property file with name=value pairs and some init() method that is called at the beginning may be sufficient. > Please try it: > > To build and run it, > - put ANT in your path > - provide a build.properties with the location of the lucene Jar file > (lucene.jar=) > (just like javacc in lucene/build.xml) > - put HTTPClient.jar from http://innovation.ch/java and jakarta-oro > library > into libs > - type: > > ant > run -Dstart=<starturl> -Drestrictto=<restricttourl> > -Dthreads=<numThreads> > > ex.: > ant > run -Dstart=http://localhost/ -Drestrictto=http://localhost.* > -Dthreads=5 > > note: restrictto is a regular expression; the URLs tested against it > are > normalized beforehand, which means > they are made lower case, index.* are removed, and some other > corrections > (see URLNormalizer.java for details) Removing index.* may be too bold and incorrect in some situations. > note: LuceneStorage is dumb; it just takes the WebDocument and stores > it. > That means with the current config it also stores tags, and only one > "content" field that contains everything. I plan to write another > storage > that uses the HTMLDocument from the demo package to store HTML > documents. Nice. I found NekoHTML to do a nice job of 'dehtmlization'. > Please note that when adding this storage to the storage pipeline, > the whole > crawling process becomes > CPU- instead of I/O bound. We already have plans how to do the > distribution. > > Feel free to contact me if there are questions. > Still Looking For Contributors! > > Clemens Ausgezeichnet! Otis __________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
