my experiences are that the writing to the index takes the most time except 
any parsing done by the user. I have been working on xml indexes and here the 
collection of data takes just as much time as to write. to increase speed i 
have done three things that reduced my index time from 11hours to 2,5 hours 
for the same dataset (1,3gb xml documents).

1: i index 50 documents into a ramdir, then when the limit is reached i merge 
this ramdir into a fsdir and flush the ramdir. this speeds up things 
as i then don't have to use the fsdir as much and ramdir is much faster.

2: merging a large index into a large index takes nearly as much time as 
merging a small index into a large index, so i have 4 (any number will do) 
fsdirs that i write ramdirs to and then i merge these fsdirs into one large 
fsdir at the end of a large indexrun.

3: multithreaded my application, create workerthreads that indexes into its 
own sepparate ramdir, then flushes these ramdirs into each separate fsdir 
(hench i have a fsdir for each workerthread), this because you can only write 
to a dir by one thread.

in the end this imporved my indexing time a lot...

hope some of this can help you!

mvh karl �ie


On Monday 25 March 2002 14:08, you wrote:
> I was wondering if there are tricks for making indexing faster in
> Lucene. I have a program which reads XML "documents" from a file, and
> indexes the 7 or so fields which occur in them. Most of the fields are
> very short, and the one long one averages a few hundred words.
>
> To index 20000 such records takes 615 seconds. I use an IndexWriter with
> a String as the first argument, i.e. indexing directly to disc. If I
> change the mergeFactor to 100, the time drops to 275 seconds. At 1000,
> it drops to 249s. These times are not bad in absolute terms, but the
> 20000 records represents only about 2% of my data, so indexing the whole
> lot takes many hours. Using java -Xprof and mergeFactor=10, the biggest
> consumers of processing time are:
>  22.2%     5  + 13172    java.io.RandomAccessFile.open
>  16.1%     4  +  9567    java.io.RandomAccessFile.close
>  13.3%     4  +  7880    java.io.RandomAccessFile.readBytes
>   8.1%     5  +  4818    java.io.RandomAccessFile.writeBytes
>   7.2%  4293  +     9
> org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
> Nfa_0
>   5.8%     5  +  3426    java.io.Win32FileSystem.delete
>
> I believe all of these are calls from Lucene as I don't use any of the
> above methods in my own code. readBytes and writeBytes I can believe,
> but why so much time on open and close? Incidentally with
> mergeFactor=1000, the biggest consumers are
>  29.7%     0  +  6729    java.io.RandomAccessFile.readBytes
>  19.0%  4296  +    12
> org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
> Nfa_0
>
>
> As a point of comparison, I tried AltaVista's Java SDK (Nov 2000
> release). I have a generic indexer program which differs only in the
> specific indexing calls for AV and Lucene. For the same 20000 records,
> it took only 57 seconds. This, I feel, does not speak well to Doug's
> comment in the Lucene FAQ that indexing in Lucene is very fast. If
> anyone has ideas for making it faster, I'd be interested to hear them.
>
> -- David Elworthy

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to