[
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214159#comment-16214159
]
Erick Erickson commented on LUCENE-7976:
----------------------------------------
Mike:
bq: The designer didn't think about this case
That's funny! If you only knew how many times "the designer" of some of _my_
code "didn't think about...." well, a lot of things....
bq: Erick, are these timestamp'd documents?
Some are, some aren't. Time-series data is certainly amenable to rolling over,
but I have clients with significantly different data sets that are not
timestamped and don't really work trying to add shards for new time periods.
bq: And 50% is the worst case...
true, but in situations where
> the index is in the 200G range, implying 40 segments or so default
> random ones are replaced
it gets close enough to 50% for me to consider it a norm.
bq: disks are cheap and getting cheaper.
But space isn't. I _also_ have clients who simply cannot expand their capacity
due to space constraints. I know it sounds kind of weird in this age of AWS but
it's true. Some organizations require on-prem servers, either through corporate
policy or dealing with sensitive information.
bq: Users already must have a lot of free disk space to accommodate running
merges
Right, but that makes it _worse_. To store 1TB of "live" docs, I need an extra
TB just to hold the index if it has 50% deleted docs, plus enough free space
for ongoing merges. And aggregate indexes are rapidly approaching petabytes
(not per shard of course, but.....)
This just looks to me like the natural evolution as Lucene gets applied to
ever-bigger data sets. When TMP was designed (hey, I was alive then) sharding
to deal with data sets we routinely deal with now was A Big Deal. Solr/Lucene
(OK, I'll admit ES too) have gotten much better at dealing with _much_ larger
data sets, so it's time to revisit some of the assumptions, and here we are.....
I'll also add that for lots of clients, "just add more disk space" is a fine
solution, one I recommend often. The engineering time wasted trying to work
around a problem that would be solved with $1,000 of new disks makes me tear my
hair out. And I'll add that I don't usually deal with clients that have tiny
little 1T aggregate indexes much, so my view is a bit skewed. That said,
today's edge case is tomorrow's norm.
And saying "tiny little 1T aggregate indexes" is, indeed, intended to be
ironic.....
> Add a parameter to TieredMergePolicy to merge segments that have more than X
> percent deleted documents
> ------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-7976
> URL: https://issues.apache.org/jira/browse/LUCENE-7976
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Erick Erickson
>
> We're seeing situations "in the wild" where there are very large indexes (on
> disk) handled quite easily in a single Lucene index. This is particularly
> true as features like docValues move data into MMapDirectory space. The
> current TMP algorithm allows on the order of 50% deleted documents as per a
> dev list conversation with Mike McCandless (and his blog here:
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many
> TB) solutions like "you need to distribute your collection over more shards"
> become very costly. Additionally, the tempting "optimize" button exacerbates
> the issue since once you form, say, a 100G segment (by
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like
> <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with
> >> smaller segments to bring the resulting segment up to 5G. If no smaller
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize).
> >> It would be rewritten into a single segment removing all deleted docs no
> >> matter how big it is to start. The 100G example above would be rewritten
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the
> default would be the same behavior we see now. As it stands now, though,
> there's no way to recover from an optimize/forceMerge except to re-index from
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the
> wild" with 10s of shards replicated 3 or more times. And that doesn't even
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A
> new merge policy is certainly an alternative.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]