I was wondering if there are tricks for making indexing faster in Lucene. I have a program which reads XML "documents" from a file, and indexes the 7 or so fields which occur in them. Most of the fields are very short, and the one long one averages a few hundred words.
To index 20000 such records takes 615 seconds. I use an IndexWriter with a String as the first argument, i.e. indexing directly to disc. If I change the mergeFactor to 100, the time drops to 275 seconds. At 1000, it drops to 249s. These times are not bad in absolute terms, but the 20000 records represents only about 2% of my data, so indexing the whole lot takes many hours. Using java -Xprof and mergeFactor=10, the biggest consumers of processing time are: 22.2% 5 + 13172 java.io.RandomAccessFile.open 16.1% 4 + 9567 java.io.RandomAccessFile.close 13.3% 4 + 7880 java.io.RandomAccessFile.readBytes 8.1% 5 + 4818 java.io.RandomAccessFile.writeBytes 7.2% 4293 + 9 org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove Nfa_0 5.8% 5 + 3426 java.io.Win32FileSystem.delete I believe all of these are calls from Lucene as I don't use any of the above methods in my own code. readBytes and writeBytes I can believe, but why so much time on open and close? Incidentally with mergeFactor=1000, the biggest consumers are 29.7% 0 + 6729 java.io.RandomAccessFile.readBytes 19.0% 4296 + 12 org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove Nfa_0 As a point of comparison, I tried AltaVista's Java SDK (Nov 2000 release). I have a generic indexer program which differs only in the specific indexing calls for AV and Lucene. For the same 20000 records, it took only 57 seconds. This, I feel, does not speak well to Doug's comment in the Lucene FAQ that indexing in Lucene is very fast. If anyone has ideas for making it faster, I'd be interested to hear them. -- David Elworthy -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
