my experiences are that the writing to the index takes the most time except any parsing done by the user. I have been working on xml indexes and here the collection of data takes just as much time as to write. to increase speed i have done three things that reduced my index time from 11hours to 2,5 hours for the same dataset (1,3gb xml documents).
1: i index 50 documents into a ramdir, then when the limit is reached i merge this ramdir into a fsdir and flush the ramdir. this speeds up things as i then don't have to use the fsdir as much and ramdir is much faster. 2: merging a large index into a large index takes nearly as much time as merging a small index into a large index, so i have 4 (any number will do) fsdirs that i write ramdirs to and then i merge these fsdirs into one large fsdir at the end of a large indexrun. 3: multithreaded my application, create workerthreads that indexes into its own sepparate ramdir, then flushes these ramdirs into each separate fsdir (hench i have a fsdir for each workerthread), this because you can only write to a dir by one thread. in the end this imporved my indexing time a lot... hope some of this can help you! mvh karl �ie On Monday 25 March 2002 14:08, you wrote: > I was wondering if there are tricks for making indexing faster in > Lucene. I have a program which reads XML "documents" from a file, and > indexes the 7 or so fields which occur in them. Most of the fields are > very short, and the one long one averages a few hundred words. > > To index 20000 such records takes 615 seconds. I use an IndexWriter with > a String as the first argument, i.e. indexing directly to disc. If I > change the mergeFactor to 100, the time drops to 275 seconds. At 1000, > it drops to 249s. These times are not bad in absolute terms, but the > 20000 records represents only about 2% of my data, so indexing the whole > lot takes many hours. Using java -Xprof and mergeFactor=10, the biggest > consumers of processing time are: > 22.2% 5 + 13172 java.io.RandomAccessFile.open > 16.1% 4 + 9567 java.io.RandomAccessFile.close > 13.3% 4 + 7880 java.io.RandomAccessFile.readBytes > 8.1% 5 + 4818 java.io.RandomAccessFile.writeBytes > 7.2% 4293 + 9 > org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove > Nfa_0 > 5.8% 5 + 3426 java.io.Win32FileSystem.delete > > I believe all of these are calls from Lucene as I don't use any of the > above methods in my own code. readBytes and writeBytes I can believe, > but why so much time on open and close? Incidentally with > mergeFactor=1000, the biggest consumers are > 29.7% 0 + 6729 java.io.RandomAccessFile.readBytes > 19.0% 4296 + 12 > org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove > Nfa_0 > > > As a point of comparison, I tried AltaVista's Java SDK (Nov 2000 > release). I have a generic indexer program which differs only in the > specific indexing calls for AV and Lucene. For the same 20000 records, > it took only 57 seconds. This, I feel, does not speak well to Doug's > comment in the Lucene FAQ that indexing in Lucene is very fast. If > anyone has ideas for making it faster, I'd be interested to hear them. > > -- David Elworthy -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
