Re: Considering lucene

Paul Smith Sun, 02 Oct 2005 16:24:51 -0700


On 01/10/2005, at 6:30 AM, Erik Hatcher wrote:

On Sep 30, 2005, at 1:26 AM, Paul Smith wrote:
This requirement is almost exactly the same as my requirement forthe log4j project I work on where I wanted to be able to indexevery row in a text log file to be it's own Document.
It works fine, but treating each line as a Document turns out totake a while to index (searching is fantastic though I have tosay) due to the cost of adding a Document to an index. I don'tthink Lucene is currently tuned (or tunable) to that level ofDocument granularity, so it'll depend on your requirement oftimeliness of the indexing.
There are several tunable indexing parameters that can help withbatch indexing. By default it is mostly tuned for incrementalindexing, but for rapid batch indexing you may need to tune it tomerge less often.

Yep, mergeFactor et al. We currently have it at 1000 (with 8concurrent threads creating Project-based indices, so that could be8000 open files during search, unless I'm mistaken), plus increasedthe value for maxBufferedDocs as per standard practices.

I was hoping (of course it's a big ask) to be able to index amillion rows of relatively short lines of text (as log files tendto be) in a 'few moments", no more than 1 minute, but even withpretty grunty hardware you run up against the bottleneck of thetokenization process (the StandardAnalyzer is not optimal at allin this case because of the way it 'signals' EOF with an exception).
Signals EOF with an exception? I'm not following that. Where doesthat occur?


See our recent YourKit "sampling" profile export here:

http://people.apache.org/~psmith/For%20Lucene%20list/IOExceptionProfiling.html

This is a full production test run over 5 hours indexing 6.5 millionrecords (approx 30 fields) running on Dual P4 Xeon servers with 10KSCSI disks. You'll note that a good chunk (35%) of the time of theindexing thread is spent in 2 methods of theStandardTokenizerManager. When you look at the source code for these2 methods you will see that it relies FastCharStream's method ofIOException to 'flag' EOF:


    if (charsRead == -1)
      throw new IOException("read past eof");

(line 72-ish)

Of course, we _could_ always write our own analyzer, but it would bereal nice if the out-of-the-box one was even better.

There was someone (apoligise, I've forgotten his name, I blame theholiday I just came back from) that could treat a relatively smallfile, such as an XML file, and very quickly index that for on thefly XPath like queries using Lucene which apparently works verywell, but I'm not sure it scales to massive documents such as logfiles (and your requirements).
Wolfgang Hoschek and the NUX project may be what you're referringto. He contributed the MemoryIndex feature found under contrib/memory. I'm not sure that feature is a good fit for the log fileor indexing files line-by-line though.


Yes, Wolfgang's code is very cool, but would only work on small texts.

cheers,

Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Considering lucene

Reply via email to