See my note about overlapping indexing documents with merging:
http://www.gossamer-threads.com/lists/lucene/java-user/34188?search_string=%2Bkeegan%20%2Baddindexes;#34188 Peter On 6/12/06, Michael D. Curtin <[EMAIL PROTECTED]> wrote:
Nadav Har'El wrote: > Otis Gospodnetic <[EMAIL PROTECTED]> wrote on 12/06/2006 04:36:45 > PM: > > >>Nadav, >> >>Look up one of my onjava.com Lucene articles, where I talk about >>this. You may also want to tell Lucene to merge segments on disk >>less frequently, which is what mergeFactor does. > > > Thanks. Can you please point me to the appropriate article (I found one > from March 2003, but I'm not sure if it's the one you meant). > > About mergeFactor() - thanks for the hint, I'll try changing it too (I used > 20 so far), and see if it helps performance. > > Still, there is one thing about mergeFactor(), and the merge process, that > I don't understand: does having more memory help this process at all? Does > having a large mergeFactor() actually require more memory? The reason I'm > asking this that I'm still trying to figure out whether having a machine > with huge ram actually helps Lucene, or not. I'm using 1.4.3, so I don't know if things are the same in 2.0. Anyhow, I found a significant performance benefit from changing minMergeDocs and mergeFactor from their defaults of 10 and 10 to 1,000 and 70, respectively. The improvement seems to come from a reduction in the number of merges as the index is created. Each merge involves reading and writing a bunch of data already indexed, sometimes everything indexed so far, so it's easy to see how reducing the number of merges reduces the overall indexing time. I can't remember why, but I also little benefit to increasing minMergeDocs beyond 1000. A lot of time was being spent in the first merge, which takes a bunch of one-document "segments" in a RAMDirectory and merges them into the first-level segments on disk. I hacked the code to make this first merge (and ONLY the first merge) operate on minMergeDocs * mergeFactor documents instead, which greatly increased the RAM consumption but reduced the indexing time. In detail, what I started with was: a. read minMergeDocs of docs, creating one-doc segments in RAM b. read those one-doc RAM segments and merge them c. write the merged results to a disk segment ... i. read mergeFactor first-level disk segments and merge them j. write second-level segments to disk ... p. normal disk-based merging thereafter, as necessary And what I ended up with was: A. read minMergeDocs * mergeFactor docs, and remember them in RAM B. write a segment from all the remembered RAM docs (a modified merge) ... F. normal disk-based merging thereafter, as necessary In essence, I eliminated that first level merge, one that involved lots and lots of teeny-weeny I/O operations that were very inefficient. In my case, steps A & B worked on 70,000 documents instead of 1,000. Remembering all those docs required a lot of RAM (almost 2GB), but it almost tripled indexing performance. Later, I had to knock the 70 down to 35 (maybe because my docs got a lot bigger but I don't remember now), but you get the idea. I couldn't use a mergeFactor of 70,000 because that's way more file descriptors than I could have without recompiling the kernel (I seem to remember my limit being 1,024, and each segment took 14 file descriptors). Hope it helps. --MDC --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]