Nadav Har'El wrote:

Otis Gospodnetic <[EMAIL PROTECTED]> wrote on 12/06/2006 04:36:45
PM:


Nadav,

Look up one of my onjava.com Lucene articles, where I talk about
this.  You may also want to tell Lucene to merge segments on disk
less frequently, which is what mergeFactor does.


Thanks. Can you please point me to the appropriate article (I found one
from March 2003, but I'm not sure if it's the one you meant).

About mergeFactor() - thanks for the hint, I'll try changing it too (I used
20 so far), and see if it helps performance.

Still, there is one thing about mergeFactor(), and the merge process, that
I don't understand: does having more memory help this process at all? Does
having a large mergeFactor() actually require more memory? The reason I'm
asking this that I'm still trying to figure out whether having a machine
with huge ram actually helps Lucene, or not.

I'm using 1.4.3, so I don't know if things are the same in 2.0. Anyhow, I found a significant performance benefit from changing minMergeDocs and mergeFactor from their defaults of 10 and 10 to 1,000 and 70, respectively. The improvement seems to come from a reduction in the number of merges as the index is created. Each merge involves reading and writing a bunch of data already indexed, sometimes everything indexed so far, so it's easy to see how reducing the number of merges reduces the overall indexing time.

I can't remember why, but I also little benefit to increasing minMergeDocs beyond 1000. A lot of time was being spent in the first merge, which takes a bunch of one-document "segments" in a RAMDirectory and merges them into the first-level segments on disk. I hacked the code to make this first merge (and ONLY the first merge) operate on minMergeDocs * mergeFactor documents instead, which greatly increased the RAM consumption but reduced the indexing time. In detail, what I started with was:
  a.  read minMergeDocs of docs, creating one-doc segments in RAM
  b.  read those one-doc RAM segments and merge them
  c.  write the merged results to a disk segment
  ...
  i.  read mergeFactor first-level disk segments and merge them
  j.  write second-level segments to disk
  ...
  p.  normal disk-based merging thereafter, as necessary

And what I ended up with was:
  A.  read minMergeDocs * mergeFactor docs, and remember them in RAM
  B.  write a segment from all the remembered RAM docs (a modified merge)
  ...
  F.  normal disk-based merging thereafter, as necessary

In essence, I eliminated that first level merge, one that involved lots and lots of teeny-weeny I/O operations that were very inefficient.

In my case, steps A & B worked on 70,000 documents instead of 1,000. Remembering all those docs required a lot of RAM (almost 2GB), but it almost tripled indexing performance. Later, I had to knock the 70 down to 35 (maybe because my docs got a lot bigger but I don't remember now), but you get the idea. I couldn't use a mergeFactor of 70,000 because that's way more file descriptors than I could have without recompiling the kernel (I seem to remember my limit being 1,024, and each segment took 14 file descriptors).

Hope it helps.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to