Re: Quick Questions on Merging

Erick Erickson Thu, 03 Jan 2019 14:33:29 -0800

1> A segment is a miniature index that holds part of the total logical
index, each segment is complete in and of itself.
All the files with the same prefix comprise a single segment. I.e.
_0.ftd, _0.fdx, _0.tim... all make up a segment. See:
https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/codecs/lucene70/package-summary.html.
Each extension holds different information about that segment.

2> No. The segments_N file contains a list of the current segments as
of come commit point. In the absence of active indexing, segments_n
will contain all the segments in the index directory. There's a lot of
nuance here that I'm skipping about how segments come and go based on
background merging and the like, how an "index searcher" only "sees"
certain segments until a new searcher is opened and the like, but
that's kind of extraneous at this point.

3> Yes, kind of. Don't think of it as "files" though, think of it as
"segments". IOW, if segments 0, 1, 2, 3 are being merged into segment
4, then _0.fdt, _1.fdt, _2.fdt and _3.fdt will be merged into _4.fdt
and so on for all the different extensions. Once all the merging is
done and a new searcher is opened, _0.*, _1.*, _2.* and _3.* will be
deleted.

4> Pretty much. Again, think of it as segments rather than files
though. Here's Mike McCandless' excellent blog on the topic:
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html.
TieredMergePolicy (TMP) is the default (third graphic down IIRC).
Basically, your maxMergeAtOnce being set to 10 means that 10 roughly
same-sized segments will be merged into a new segment. The idea here
is that let's say maxMergeAtOnce is 3 ('cause it's easier to enumerate
than 10). Let's further say you have 3 segments, of sizes (in M) 1, 1,
100. It'd be extremely wasteful to rewrite that 100M segment into a
new segment just to add 2 more M, so TMP waits until there are three
smaller segments 1, 1, 1, 100 and merges the three similar sized
segments into one so you wind up with two segments of sizes 3 and 100.
When there are 3 3M segments, they're merged into a 9M segment and so
on. Incidentally, the default max segment  size is 5G so at some point
you'll have segments that won't be merged unless they have a lot of
deleted docs.

I'm skipping a _lot_ here about how "like sized" segments are chosen.

All that said, by and large you should simply ignore this unless
you're trying to troubleshoot some kind of performance issue.......

Best,
Erick

On Thu, Jan 3, 2019 at 1:58 PM John Wilson <[email protected]> wrote:
>
> Hi,
>
> I'm watching my index directory while indexing million documents. While my 
> indexer runs, I see a number of files with extensions like tip, doc, tim, 
> fdx, fdt, etc being created. The total number of these files goes up and down 
> during the run -- from as high as 1500 in the middle of the run to 290 when 
> the indexer completes. Finally, I see that an additional file segments_1 
> being created.
>
> My questions:
>
> What exactly is a segment?
> In my case, does it mean that I just have 1 segment since I have just one 
> segments_1 file? Or,
> Is it the case that files of the same type (extension) get merged together 
> into bigger files? For example, many fdt files being merged into one or 
> bigger fdt files?
> maxMergeAtOnce specifies the # of many segments at once to merge. In my case, 
> what does this mean? If I set it to 10, for example, does it mean that once 
> the # of files for a specific file type (e.g. fdt) reaches 10, it is combined 
> into a single fdt file?
>
> Thanks in advance!

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Quick Questions on Merging

Reply via email to