Hi list, I'm having a trouble with achieving good performance when indexing XML wikipedia dump. The indexing process works as follows
1. setup FSDirectory 2. setup IndexWriter 3. setup custom analyzer chaining wikipediatokenizer, lowercasefilter, porterstemmer, stopfilter and lengthfilter 3. create XMLStreamReader that reads from XML file 4. run the parser and get <text> tag contents as well as <title> contents and insert them into Document 5. add document to the index the options for the writer are - compound file is turned off - merge factor set to 150 - ram buffer size is set to 300 MB in addition to that the XML stream is read using bufferedfilereader with buffer size of 100 MB This all works good for the first couple of minutes indexing extracted articles very quickly but later on some problems start to show. The symptoms are: - the CPU is at 100% and the stream reading and indexing seems to be stopped - the application seems to be dead - it resumes after some time (anywhere between 1 to 40 minutes) I've double checked my code for any problems and even rewritten it a couple of times so this makes me think that there's some problem in lucene itself. The problem is persistent in both 2.4.1 and 2.9-dev versions. Is there any known bug related to long running batch indexing processes that operate on large documents? In my case the single XML file is 20 GB and I'm just surprised how quickly the performance of the indexer degrades. Do you have any suggestions? best, Mateusz Berezecki --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org