On 01/10/2005, at 6:30 AM, Erik Hatcher wrote:
On Sep 30, 2005, at 1:26 AM, Paul Smith wrote:
This requirement is almost exactly the same as my requirement for
the log4j project I work on where I wanted to be able to index
every row in a text log file to be it's own Document.
It works fine, but treating each line as a Document turns out to
take a while to index (searching is fantastic though I have to
say) due to the cost of adding a Document to an index. I don't
think Lucene is currently tuned (or tunable) to that level of
Document granularity, so it'll depend on your requirement of
timeliness of the indexing.
There are several tunable indexing parameters that can help with
batch indexing. By default it is mostly tuned for incremental
indexing, but for rapid batch indexing you may need to tune it to
merge less often.
Yep, mergeFactor et al. We currently have it at 1000 (with 8
concurrent threads creating Project-based indices, so that could be
8000 open files during search, unless I'm mistaken), plus increased
the value for maxBufferedDocs as per standard practices.
I was hoping (of course it's a big ask) to be able to index a
million rows of relatively short lines of text (as log files tend
to be) in a 'few moments", no more than 1 minute, but even with
pretty grunty hardware you run up against the bottleneck of the
tokenization process (the StandardAnalyzer is not optimal at all
in this case because of the way it 'signals' EOF with an exception).
Signals EOF with an exception? I'm not following that. Where does
that occur?
See our recent YourKit "sampling" profile export here:
http://people.apache.org/~psmith/For%20Lucene%20list/
IOExceptionProfiling.html
This is a full production test run over 5 hours indexing 6.5 million
records (approx 30 fields) running on Dual P4 Xeon servers with 10K
SCSI disks. You'll note that a good chunk (35%) of the time of the
indexing thread is spent in 2 methods of the
StandardTokenizerManager. When you look at the source code for these
2 methods you will see that it relies FastCharStream's method of
IOException to 'flag' EOF:
if (charsRead == -1)
throw new IOException("read past eof");
(line 72-ish)
Of course, we _could_ always write our own analyzer, but it would be
real nice if the out-of-the-box one was even better.
There was someone (apoligise, I've forgotten his name, I blame the
holiday I just came back from) that could treat a relatively small
file, such as an XML file, and very quickly index that for on the
fly XPath like queries using Lucene which apparently works very
well, but I'm not sure it scales to massive documents such as log
files (and your requirements).
Wolfgang Hoschek and the NUX project may be what you're referring
to. He contributed the MemoryIndex feature found under contrib/
memory. I'm not sure that feature is a good fit for the log file
or indexing files line-by-line though.
Yes, Wolfgang's code is very cool, but would only work on small texts.
cheers,
Paul Smith
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]