LadyForest commented on PR #119:
URL:
https://github.com/apache/flink-table-store/pull/119#issuecomment-1147245509
As discussed offline, the new implementation has been modified.
* For non-rescale bucket compaction, we don't perform a scan at the planning
phase. Instead, we put a flag along with part spec to indicate it is ordinary
manual trigger compaction.
* Introduce a new compaction strategy, which "deep cleans" the data layout.
The current `UniversalCompaction` is performed on `LevelSortedRun`, which
focuses on solving the write amplification issue. However, for manually
triggered compaction, we want to eliminate all intersected key ranges, such
that after this compaction, the scan can simply perform a concatenation (not
merge) read. Meanwhile, we want to compact small files with the best effort.
Thus the proposed strategy works as follows.
1. Use `IntervalPartition` algorithm to partition all scanned manifests
into different sections (`List<List<SortedRun>>`). The key range between
different sections does not overlap. And sorted runs which fall into one
section share the key range.
2. As a result, filter the sections which have sorted run size exceeding
one will find out all overlapped files. Meanwhile, for a single section(just
containing one sorted run), if there are more than two small data files, this
section can be picked.
3. **IMPORTANT** compaction should be performed for each section, not
across different sections. Thus this strategy may pick a list of `CompactUnit`
for a single bucket, which differs from `UniversalCompaction`.
* Introduce a new `PrecommittingSinkWriter` impl to perform dedicated
compaction tasks. This writer is responsible for scanning and selecting
partition and bucket according to the current sub-task id, and then creating a
per-bucket compact writer to submit compaction. Since there's no data shuffled
between source and sink, so all the compaction is performed when
`SinkWriterOperator#endInput` is invoked.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]