Re: Indexing a largish collection of mail and usenet messages?

Christophe Ollier Tue, 02 Jan 2007 02:34:55 -0800

John L a écrit :

I have a collection of archives of mailing list and news messages. Thelargest collection is pretty big, about 150,000 messages which meansabout 200 megabytes of text, shortly to be migrated to a FreeBSDserver. The lists are all active so archives typically add a fewmessages each day. I want to provide a full text search of eacharchive. What software should I use? I have been using the sturdy butancient lqtext package. It's OK, but it has a few bugs I have yet topick and I'm wondering if something better is available.

You could have a look at Lucene (<http://lucene.apache.org/>) : a textsearch engine library written in Java. I don't know lqtext, but Luceneseems to work in a similar way : a first program builds & updates anindex, a second program allows to query the index.

It's "only" a library, you have to program the interfaces for you(indexing) and your users (querying). There are numerous ports to otherlanguages (C, Perl, Python, PHP (through ZendFramework) are in the portstree).

First, I am NOT, repeat NOT, asking about web spiders. The messages aredirectly available to indexing software as files on my server, sothere's no advantage to running them through Apache on the way to theindexer. Also, the messages in the archive never change and I know whatfiles are new each day, so it would be pointless for a package tore-spider the whole archive to look for the new messages. I am notunalterably opposed to something that spiders if it is otherwisewonderful, but that approach hasn't been fruitful in the past.


Lucene can update an existing index with new documents.

What I want ideally is something that knows enough about the structureof mail messages to deal intelligently with headers vs. body, that cando something reasonable with MIME and HTML bodies (not urgent, I canalways run them through demime on the way to the index), and mostimportantly that actually works with 150,000 messages. I've seen lotsof packages that look promising but that fall over dead once they getpast 10,000 messages or so.

I don't think Lucene can do this out of the box, but you can associateany keyword to your indexed documents (e.g. mail headers).

About performance, I'm personally satisfied. I use the PHP port, with20k documents, the full index takes about an hour to build, queriesabout 100 to 1000 ms. Lucene seems fit for millions of documents.

[...]


--
Christophe
_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Indexing a largish collection of mail and usenet messages?

Reply via email to