Re: LARM Web Crawler: LuceneStorage [experimental]

Otis Gospodnetic Tue, 18 Jun 2002 14:36:00 -0700

I see nice progress here.
I will try it in the near future (time!).

> I have added an experimental version of a LuceneStorage to the LARM
> crawler,
> available from CVS in lucene-sandbox. That means crawled documents
> can now directly be indexed into a lucene index.
> 
> 
> 
> Sorry, no configuration files yet. Config is done in
> ...larm/FetcherMain.java
> The main class FetcherMain is now configured to store the contents in
> a lucene index called "luceneIndex".
> 
> 
> Lots of open questions:
> - LARM doesn't have the notion of closing everything down. What
> happens if IndexWriter is interrupted?


As in what if it encounters an exception (e.g. somebody removes the
index directory)?  I guess one of the items that should them maybe get
added to the to-do list is checkpointing for starters.

> - I haven't tried to read from the index yet...

Heh, I'm familiar with that situation.

> - How to configure the stuff from a config file
> ... (it's late)

Property file with name=value pairs and some init() method that is
called at the beginning may be sufficient.

> Please try it:
> 
> To build and run it,
> - put ANT in your path
> - provide a build.properties with the location of the lucene Jar file
> (lucene.jar=)
>   (just like javacc in lucene/build.xml)
> - put HTTPClient.jar from http://innovation.ch/java and jakarta-oro
> library
> into libs
> - type:
> 
> ant
> run -Dstart=<starturl> -Drestrictto=<restricttourl>
> -Dthreads=<numThreads>
> 
> ex.:
> ant
> run -Dstart=http://localhost/ -Drestrictto=http://localhost.*
> -Dthreads=5
> 
> note: restrictto is a regular expression; the URLs tested against it
> are
> normalized beforehand, which means
> they are made lower case, index.* are removed, and some other
> corrections
> (see URLNormalizer.java for details)

Removing index.* may be too bold and incorrect in some situations.

> note: LuceneStorage is dumb; it just takes the WebDocument and stores
> it.
> That means with the current config it also stores tags, and only one
> "content" field that contains everything. I plan to write another
> storage
> that uses the HTMLDocument from the demo package to store HTML
> documents.

Nice.
I found NekoHTML to do a nice job of 'dehtmlization'.

> Please note that when adding this storage to the storage pipeline,
> the whole
> crawling process becomes
> CPU- instead of I/O bound. We already have plans how to do the
> distribution.
> 
> Feel free to contact me if there are questions.
> Still Looking For Contributors!
> 
> Clemens

Ausgezeichnet!

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: LARM Web Crawler: LuceneStorage [experimental]

Reply via email to