[jira] [Commented] (FLINK-21321) Change RocksDB incremental checkpoint re-scaling to use deleteRange

Joey Pereira (Jira) Mon, 15 Feb 2021 14:00:08 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-21321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284960#comment-17284960
 ]


Joey Pereira commented on FLINK-21321:
--------------------------------------

It turns out the feature was no longer marked as experimental as of {{5.18.0}}! 
I thought it still was.

Here's a full list of changes related to {{deleteRange}} I pulled from 
[https://github.com/facebook/rocksdb/blob/master/HISTORY.md].

—
h2. Unreleased
h3. Bug Fixes
 * Since 6.15.0, {{TransactionDB}} returns error {{Status}} from calls to 
{{DeleteRange()}} and calls to {{Write()}} where the {{WriteBatch}} contains a 
range deletion. Previously such operations may have succeeded while not 
providing the expected transactional guarantees. There are certain cases where 
range deletion can still be used on such DBs; see the API doc on 
{{TransactionDB::DeleteRange()}} for details.
 * {{OptimisticTransactionDB}} now returns error {{Status}} from calls to 
{{DeleteRange()}} and calls to {{Write()}} where the {{WriteBatch}} contains a 
range deletion. Previously such operations may have succeeded while not 
providing the expected transactional guarantees.

h2. 6.11 (6/12/2020)
h3. Public API Change
 * DeleteRange now returns {{Status::InvalidArgument}} if the range's end key 
comes before its start key according to the user comparator. Previously the 
behavior was undefined.

h2. 6.8.0 (02/24/2020)
h3. Bug Fixes
 * {{WriteBatchWithIndex::DeleteRange}} returns {{Status::NotSupported}}. 
Previously it returned success even though reads on the batch did not account 
for range tombstones. The corresponding language bindings now cannot be used. 
In C, that includes {{rocksdb_writebatch_wi_delete_range}}, 
{{rocksdb_writebatch_wi_delete_range_cf}}, 
{{rocksdb_writebatch_wi_delete_rangev}}, and 
{{rocksdb_writebatch_wi_delete_rangev_cf}}. In Java, that includes 
{{WriteBatchWithIndex::deleteRange}}.

h2. 6.6.1 (01/02/2020)
h3. Bug Fixes
 * Fix a bug in which a snapshot read through an iterator could be affected by 
a DeleteRange after the snapshot (#6062).

h2. 5.18.0 (11/30/2018)
h3. New Features
 * Improved {{DeleteRange}} to prevent read performance degradation. The 
feature is no longer marked as experimental.

h2. 5.10.0 (12/11/2017)
h3. Bug Fixes
 * Fix possible corruption to LSM structure when {{DeleteFilesInRange()}} 
deletes a subset of files spanned by a {{DeleteRange()}} marker.

h2. 5.9.0 (11/1/2017)
h3. Bug Fixes
 * Fix possible metadata corruption in databases using {{DeleteRange()}}.

h2. 5.7.0 (07/13/2017)
h3. Bug Fixes
 * Fix discarding empty compaction output files when {{DeleteRange()}} is used 
together with subcompactions.

h2. 5.0.0 (11/17/2016)
h3. Public API Change
 * Introduce DB::DeleteRange for optimized deletion of large ranges of 
contiguous keys.

> Change RocksDB incremental checkpoint re-scaling to use deleteRange
> -------------------------------------------------------------------
>
>                 Key: FLINK-21321
>                 URL: https://issues.apache.org/jira/browse/FLINK-21321
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Joey Pereira
>            Priority: Minor
>              Labels: pull-request-available
>
> In FLINK-8790, it was suggested to use RocksDB's {{deleteRange}} API to more 
> efficiently clip the databases for the desired target group.
> During the PR for that ticket, 
> [#5582|https://github.com/apache/flink/pull/5582], the change did not end up 
> using the {{deleteRange}} method  as it was an experimental feature in 
> RocksDB.
> At this point {{deleteRange}} is in a far less experimental state now but I 
> believe is still formally "experimental". It is heavily by many others like 
> CockroachDB and TiKV and they have teased out several bugs in complex 
> interactions over the years.
> For certain re-scaling situations where restores trigger 
> {{restoreWithScaling}} and the DB clipping logic, this would likely reduce an 
> O[n] operation (N = state size/records) to O(1). For large state apps, this 
> would potentially represent a non-trivial amount of time spent for 
> re-scaling. In the case of my workplace, we have an operator with 100s of 
> billions of records in state and re-scaling was taking a long time (>>30min, 
> but it has been awhile since doing it).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21321) Change RocksDB incremental checkpoint re-scaling to use deleteRange

Reply via email to