And one more thought (not very well thought out).

A parameter on TMP (or whatever) that did <3> something like:
> a parameter <autoCompactTime>
> a parameter <autoCompactPct>
> On startup TMP takes the current timestamp
*> Every minute (or whatever) it checks the current timestamp and if
<autoCompactTime> is in between the last check time and now, do <2>.
> set the last checked time to the value from * above.

Taking the current timestamp would keep from kicking of the compaction
on startup, so we wouldn't need to keep some stateful information
across restarts and wouldn't go into a compact cycle on startup.

Erick

On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
<erickerick...@gmail.com> wrote:
> I've been thinking about this a little more. Since this is an outlier,
> I'm loathe to change the core TMP merge selection process. Say the max
> segment size if 5G. You'd be doing an awful lot of I/O to merge a
> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
> really allow users who issue the tempting "optimize" command to
> recover; that one huge segment can hang around for a _very_ long time,
> accumulating lots of deleted docs. Same with expungeDeletes.
>
> I can think of several approaches:
>
> 1> despite my comment above, a flag that says something like "if a
> segment has > X% deleted docs, merge it with a smaller segment anyway
> respecting the max segment size. I know, I know this will affect
> indexing throughput, do it anyway".
>
> 2> A special op (or perhaps a flag on expungeDeletes) that would
> behave like <1> but on-demand rather than part of standard merging.
>
> In both of these cases, if a segment had > X% deleted docs but the
> live doc size for that segment was > the max seg size, rewrite it into
> a single new segment removing deleted docs.
>
> 3> some way to do the above on a schedule. My notion is something like
> a maintenance window at 1:00 AM. You'd still have a live collection,
> but (presumably) a way to purge the day's accumulation of deleted
> documents during off hours.
>
> 4> ???
>
> I probably like <2> best so far, I don't see this condition in the
> wild very often it usually occurs during heavy re-indexing operations
> and often after an optimize or expungeDeletes has happened. <1> could
> get horribly pathological if the threshold was 1% or something.
>
> WDYT?
>
>
> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson <erickerick...@gmail.com> 
> wrote:
>> Thanks Mike:
>>
>> bq: Or are you saying that each segments 20% of not-deleted docs is
>> still greater than 1/2 of the max segment size, and so TMP considers
>> them ineligible?
>>
>> Exactly.
>>
>> Hadn't seen the blog, thanks for that. Added to my list of things to refer 
>> to.
>>
>> The problem we're seeing is that "in the wild" there are cases where
>> people can now get satisfactory performance from huge numbers of
>> documents, as in close to 2B (there was a question on the user's list
>> about that recently). So allowing up to 60% deleted documents is
>> dangerous in that situation.
>>
>> And the situation is exacerbated by optimizing (I know, "don't do that").
>>
>> Ah, well, the joys of people using this open source thing and pushing
>> its limits.
>>
>> Thanks again,
>> Erick
>>
>> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
>> <luc...@mikemccandless.com> wrote:
>>> Hi Erick,
>>>
>>> Some questions/answers below:
>>>
>>> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson <erickerick...@gmail.com>
>>> wrote:
>>>>
>>>> Particularly interested if Mr. McCandless has any opinions here.
>>>>
>>>> I admit it took some work, but I can create an index that never merges
>>>> and is 80% deleted documents using TieredMergePolicy.
>>>>
>>>> I'm trying to understand how indexes "in the wild" can have > 30%
>>>> deleted documents. I think the root issue here is that
>>>> TieredMergePolicy doesn't consider for merging any segments > 50% of
>>>> maxMergedSegmentMB of non-deleted documents.
>>>>
>>>> Let's say I have segments at the default 5G max. For the sake of
>>>> argument, it takes exactly 5,000,000 identically-sized documents to
>>>> fill the segment to exactly 5G.
>>>>
>>>> IIUC, as long as the segment has more than 2,500,000 documents in it
>>>> it'll never be eligible for merging.
>>>
>>>
>>> That's right.
>>>
>>>>
>>>> The only way to force deleted
>>>> docs to be purged is to expungeDeletes or optimize, neither of which
>>>> is recommended.
>>>
>>>
>>> +1
>>>
>>>> The condition I created was highly artificial but illustrative:
>>>> - I set my max segment size to 20M
>>>> - Through experimentation I found that each segment would hold roughly
>>>> 160K synthetic docs.
>>>> - I set my ramBuffer to 1G.
>>>> - Then I'd index 500K docs, then delete 400K of them, and commit. This
>>>> produces a single segment occupying (roughly) 80M of disk space, 15M
>>>> or so of it "live" documents the rest deleted.
>>>> - rinse, repeat with a disjoint set of doc IDs.
>>>>
>>>> The number of segments continues to grow forever, each one consisting
>>>> of 80% deleted documents.
>>>
>>>
>>> But wouldn't TMP at some point merge these segments?  Or are you saying that
>>> each segments 20% of not-deleted docs is still greater than 1/2 of the max
>>> segment size, and so TMP considers them ineligible?
>>>
>>> This is indeed a rather pathological case, and you're right TMP would never
>>> merge them (if my logic above is right).  Maybe we could tweak TMP for
>>> situations like this, though I'm not sure they happen in practice.  Normally
>>> the max segment size is quite a bit larger than the initially flushed
>>> segment sizes.
>>>
>>>>
>>>> This artificial situation just allowed me to see how the segments
>>>> merged. Without such artificial constraints I suspect the limit for
>>>> deleted documents would be capped at 50% theoretically and in practice
>>>> less than that although I have seen 35% or so deleted documents in the
>>>> wild.
>>>
>>>
>>> Yeah I think so too.  I wrote this blog post about deletions:
>>> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents
>>>
>>> It has a fun chart showing how the %tg deleted docs bounces around.
>>>
>>>>
>>>> So at the end of the day I have a couple of questions:
>>>>
>>>> 1> Is my understanding close to correct? This is really the first time
>>>> I've had to dive into the guts of merging.
>>>
>>>
>>> Yes!
>>>
>>>>
>>>> 2> Is there a way I've missed to slim down an index other than
>>>> expungedeletes of optimize/forcemerge?
>>>
>>>
>>> No.
>>>
>>>> It seems to me like eventually, with large indexes, every segment that
>>>> is the max size allowed is going to have to go over 50% deletes before
>>>> being merged and there will have to be at least two of them. I don't
>>>> see a clean way to fix this, any algorithm would likely be far too
>>>> expensive to be part of regular merging. I suppose we could merge
>>>> segments of different sizes if the combined size was < max segment
>>>> size. On a quick glance it doesn't seem like the log merge policies
>>>> address this kind of case either, but haven't dug into them much.
>>>
>>>
>>> TMP should be able to merge one max sized segment (that has eek'd just over
>>> 50% deleted docs) with smaller sized segments.  It would not prefer this
>>> merge, since merging substantially different segment sizes is poor
>>> performance vs. merging equally sized segments, but it does have a bias for
>>> removing deleted docs that would offset that.
>>>
>>>>
>>>> Thanks!
>>>
>>>
>>> You're welcome!
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to