indexing performance problems

Mateusz Berezecki Mon, 08 Jun 2009 03:24:07 -0700

Hi list,

I'm having a trouble with achieving good performance when indexing XML
wikipedia dump.
The indexing process works as follows


1. setup FSDirectory
2. setup IndexWriter
3. setup custom analyzer chaining wikipediatokenizer, lowercasefilter,
porterstemmer, stopfilter and lengthfilter
3. create XMLStreamReader that reads from XML file
4. run the parser and get <text> tag contents as well as <title>
contents and insert them into Document
5. add document to the index

the options for the writer are
- compound file is turned off
- merge factor set to 150
- ram buffer size is set to 300 MB

in addition to that the XML stream is read using bufferedfilereader
with buffer size of 100 MB

This all works good for the first couple of minutes indexing extracted
articles very quickly but later on some problems start to show. The
symptoms are:
- the CPU is at 100% and the stream reading and indexing seems to be stopped
- the application seems to be dead
- it resumes after some time (anywhere between 1 to 40 minutes)

 I've double checked my code for any problems and even rewritten it a
couple of times so this makes me think that there's some problem in
lucene itself. The problem is persistent in both 2.4.1 and 2.9-dev
versions.

Is there any known bug related to long running batch indexing
processes that operate on large documents? In my case the single XML
file is 20 GB and I'm just surprised how quickly the performance of
the indexer degrades.

Do you have any suggestions?

best,
Mateusz Berezecki

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

indexing performance problems

Reply via email to