bq: I guess the alternative would be to occasionally roll the dice and decide to merge that kind of segment.
That's what I was getting to with the "autoCompact" idea in a more deterministic manner. On Mon, Aug 28, 2017 at 1:32 PM, Walter Underwood <wun...@wunderwood.org> wrote: > That makes sense. > > I guess the alternative would be to occasionally roll the dice and decide to > merge that kind of segment. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Aug 28, 2017, at 1:28 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > I don't think jitter would help. As long as a segment has > 50% max > segment size "live" docs, it's forever ineligible for merging (outside > optimize of expungeDeletes commands). So the "zone" is anything over > 50%. > > Or I missed your point. > > Erick > > On Mon, Aug 28, 2017 at 12:50 PM, Walter Underwood > <wun...@wunderwood.org> wrote: > > If this happens in a precise zone, how about adding some random jitter to > the threshold? That tends to get this kind of lock-up unstuck. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Aug 28, 2017, at 12:44 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > And one more thought (not very well thought out). > > A parameter on TMP (or whatever) that did <3> something like: > > a parameter <autoCompactTime> > a parameter <autoCompactPct> > On startup TMP takes the current timestamp > > *> Every minute (or whatever) it checks the current timestamp and if > <autoCompactTime> is in between the last check time and now, do <2>. > > set the last checked time to the value from * above. > > > Taking the current timestamp would keep from kicking of the compaction > on startup, so we wouldn't need to keep some stateful information > across restarts and wouldn't go into a compact cycle on startup. > > Erick > > On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson > <erickerick...@gmail.com> wrote: > > I've been thinking about this a little more. Since this is an outlier, > I'm loathe to change the core TMP merge selection process. Say the max > segment size if 5G. You'd be doing an awful lot of I/O to merge a > segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't > really allow users who issue the tempting "optimize" command to > recover; that one huge segment can hang around for a _very_ long time, > accumulating lots of deleted docs. Same with expungeDeletes. > > I can think of several approaches: > > 1> despite my comment above, a flag that says something like "if a > segment has > X% deleted docs, merge it with a smaller segment anyway > respecting the max segment size. I know, I know this will affect > indexing throughput, do it anyway". > > 2> A special op (or perhaps a flag on expungeDeletes) that would > behave like <1> but on-demand rather than part of standard merging. > > In both of these cases, if a segment had > X% deleted docs but the > live doc size for that segment was > the max seg size, rewrite it into > a single new segment removing deleted docs. > > 3> some way to do the above on a schedule. My notion is something like > a maintenance window at 1:00 AM. You'd still have a live collection, > but (presumably) a way to purge the day's accumulation of deleted > documents during off hours. > > 4> ??? > > I probably like <2> best so far, I don't see this condition in the > wild very often it usually occurs during heavy re-indexing operations > and often after an optimize or expungeDeletes has happened. <1> could > get horribly pathological if the threshold was 1% or something. > > WDYT? > > > On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > Thanks Mike: > > bq: Or are you saying that each segments 20% of not-deleted docs is > still greater than 1/2 of the max segment size, and so TMP considers > them ineligible? > > Exactly. > > Hadn't seen the blog, thanks for that. Added to my list of things to refer > to. > > The problem we're seeing is that "in the wild" there are cases where > people can now get satisfactory performance from huge numbers of > documents, as in close to 2B (there was a question on the user's list > about that recently). So allowing up to 60% deleted documents is > dangerous in that situation. > > And the situation is exacerbated by optimizing (I know, "don't do that"). > > Ah, well, the joys of people using this open source thing and pushing > its limits. > > Thanks again, > Erick > > On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless > <luc...@mikemccandless.com> wrote: > > Hi Erick, > > Some questions/answers below: > > On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > Particularly interested if Mr. McCandless has any opinions here. > > I admit it took some work, but I can create an index that never merges > and is 80% deleted documents using TieredMergePolicy. > > I'm trying to understand how indexes "in the wild" can have > 30% > deleted documents. I think the root issue here is that > TieredMergePolicy doesn't consider for merging any segments > 50% of > maxMergedSegmentMB of non-deleted documents. > > Let's say I have segments at the default 5G max. For the sake of > argument, it takes exactly 5,000,000 identically-sized documents to > fill the segment to exactly 5G. > > IIUC, as long as the segment has more than 2,500,000 documents in it > it'll never be eligible for merging. > > > > That's right. > > > The only way to force deleted > docs to be purged is to expungeDeletes or optimize, neither of which > is recommended. > > > > +1 > > The condition I created was highly artificial but illustrative: > - I set my max segment size to 20M > - Through experimentation I found that each segment would hold roughly > 160K synthetic docs. > - I set my ramBuffer to 1G. > - Then I'd index 500K docs, then delete 400K of them, and commit. This > produces a single segment occupying (roughly) 80M of disk space, 15M > or so of it "live" documents the rest deleted. > - rinse, repeat with a disjoint set of doc IDs. > > The number of segments continues to grow forever, each one consisting > of 80% deleted documents. > > > > But wouldn't TMP at some point merge these segments? Or are you saying that > each segments 20% of not-deleted docs is still greater than 1/2 of the max > segment size, and so TMP considers them ineligible? > > This is indeed a rather pathological case, and you're right TMP would never > merge them (if my logic above is right). Maybe we could tweak TMP for > situations like this, though I'm not sure they happen in practice. Normally > the max segment size is quite a bit larger than the initially flushed > segment sizes. > > > This artificial situation just allowed me to see how the segments > merged. Without such artificial constraints I suspect the limit for > deleted documents would be capped at 50% theoretically and in practice > less than that although I have seen 35% or so deleted documents in the > wild. > > > > Yeah I think so too. I wrote this blog post about deletions: > https://www.elastic.co/blog/lucenes-handling-of-deleted-documents > > It has a fun chart showing how the %tg deleted docs bounces around. > > > So at the end of the day I have a couple of questions: > > 1> Is my understanding close to correct? This is really the first time > I've had to dive into the guts of merging. > > > > Yes! > > > 2> Is there a way I've missed to slim down an index other than > expungedeletes of optimize/forcemerge? > > > > No. > > It seems to me like eventually, with large indexes, every segment that > is the max size allowed is going to have to go over 50% deletes before > being merged and there will have to be at least two of them. I don't > see a clean way to fix this, any algorithm would likely be far too > expensive to be part of regular merging. I suppose we could merge > segments of different sizes if the combined size was < max segment > size. On a quick glance it doesn't seem like the log merge policies > address this kind of case either, but haven't dug into them much. > > > > TMP should be able to merge one max sized segment (that has eek'd just over > 50% deleted docs) with smaller sized segments. It would not prefer this > merge, since merging substantially different segment sizes is poor > performance vs. merging equally sized segments, but it does have a bias for > removing deleted docs that would offset that. > > > Thanks! > > > > You're welcome! > > Mike McCandless > > http://blog.mikemccandless.com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org