Re: Does more memory help Lucene?

Michael D. Curtin Mon, 12 Jun 2006 09:15:43 -0700

Nadav Har'El wrote:

Otis Gospodnetic <[EMAIL PROTECTED]> wrote on 12/06/2006 04:36:45
PM:

Nadav,

Look up one of my onjava.com Lucene articles, where I talk about
this.  You may also want to tell Lucene to merge segments on disk
less frequently, which is what mergeFactor does.



Thanks. Can you please point me to the appropriate article (I found one
from March 2003, but I'm not sure if it's the one you meant).

About mergeFactor() - thanks for the hint, I'll try changing it too (I used
20 so far), and see if it helps performance.

Still, there is one thing about mergeFactor(), and the merge process, that
I don't understand: does having more memory help this process at all? Does
having a large mergeFactor() actually require more memory? The reason I'm
asking this that I'm still trying to figure out whether having a machine
with huge ram actually helps Lucene, or not.

I'm using 1.4.3, so I don't know if things are the same in 2.0. Anyhow, Ifound a significant performance benefit from changing minMergeDocs andmergeFactor from their defaults of 10 and 10 to 1,000 and 70, respectively.The improvement seems to come from a reduction in the number of merges as theindex is created. Each merge involves reading and writing a bunch of dataalready indexed, sometimes everything indexed so far, so it's easy to see howreducing the number of merges reduces the overall indexing time.

I can't remember why, but I also little benefit to increasing minMergeDocsbeyond 1000. A lot of time was being spent in the first merge, which takes abunch of one-document "segments" in a RAMDirectory and merges them into thefirst-level segments on disk. I hacked the code to make this first merge (andONLY the first merge) operate on minMergeDocs * mergeFactor documents instead,which greatly increased the RAM consumption but reduced the indexing time. Indetail, what I started with was:

  a.  read minMergeDocs of docs, creating one-doc segments in RAM
  b.  read those one-doc RAM segments and merge them
  c.  write the merged results to a disk segment
  ...
  i.  read mergeFactor first-level disk segments and merge them
  j.  write second-level segments to disk
  ...
  p.  normal disk-based merging thereafter, as necessary

And what I ended up with was:
  A.  read minMergeDocs * mergeFactor docs, and remember them in RAM
  B.  write a segment from all the remembered RAM docs (a modified merge)
  ...
  F.  normal disk-based merging thereafter, as necessary

In essence, I eliminated that first level merge, one that involved lots andlots of teeny-weeny I/O operations that were very inefficient.

In my case, steps A & B worked on 70,000 documents instead of 1,000.Remembering all those docs required a lot of RAM (almost 2GB), but it almosttripled indexing performance. Later, I had to knock the 70 down to 35 (maybebecause my docs got a lot bigger but I don't remember now), but you get theidea. I couldn't use a mergeFactor of 70,000 because that's way more filedescriptors than I could have without recompiling the kernel (I seem toremember my limit being 1,024, and each segment took 14 file descriptors).


Hope it helps.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does more memory help Lucene?

Reply via email to