Re: Question about Lucene 2.3. file formats?

Michael McCandless Tue, 22 Jan 2008 12:08:09 -0800

Ivan Vasilev wrote:

Hi Lucene Guys,
As I see in the Lucene web site in file formats page the version2.3 will have some changes in file formats that are very importantfor us. First I will say what we do and then will ask my questions.
We distribute the index on some machines. The implementation ismade so that we copy some segments to one machine and for them wecreate the segments_N metadata file according to the rulesdescribed in Lucene web site. Which exactly segments we will moveto other machine we calculate based on the available disk spacesand the size in bytes of the segments. Now as I see you will usedata sharing so that some segments will store documents of someother segments. This rise some questions in us regarding how tosupport our clusterization for Lucene 2.3.

Are you referring to sharing of the docStore files (term vectors &stored fields), when autoCommit=false? Assuming so....

  1. Is this sharing temporary or it is constant? I mean is sharing
     will take place only in the process of adding documents to index
     and after that, may be when optimization or some other process is
     run the shared documents are distributed among the segments that
     use them? Or it is possible that shared documents on a segment
     will remain shared after optimizing?

Only documents added in a single IndexWriter session are shared. Ifyou run optimize, assuming the index was not already optimized, thesharing is removed since there's only one segment at the end.


Also, if you open with autoCommit=true then there is no sharing.

  2. Is there way to unshare documents – I mean when transferring a
     segment to some other machine can I transfer its documents from
     the other segment that holds them to it?

Use writer.addIndexesNoOptimize (in general, that is the safest wayto merge indices, instead of trying to build your own segments_N file).

  3. As I see in the source code in SVN of Lucene 2.3. there is class
LogByteSizeMergePolicy that allows controlling the maximalsize of
     segment that could be merged. Here I have two questions:
3.1. Can I control not only the max size of segments that will bemerged, but also the max size (or approximate max size) of segmentsthat would occur after merging?

Not really since it's hard to predict the size the segment will beafter merging. It only limits the max size of a segment that may bemerged. You could roughly guess the final size of the segments (saysum of all byte sizes, proportionally reduced based on pendingdeletes) and put that into your own MergePolicy?

3.2.Can I somehow control the maximal size of segment at all (ormay be its approximate maximal size – I mean to stop addingdocuments to a segment after it reaches some size)?

Lucene never adds docs to a segment, except by merging. So bypreventing a segment > size X from being merged (usingLogByteSizeMergePolicy), I think that's the best you can do.

3.3.Can I somehow control the maximal size of a segment and allother segments which documents are shared in it?


Not really, but your own merge policy could roughly guess...

Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question about Lucene 2.3. file formats?

Reply via email to