[GitHub] [iceberg] Reo-LEI edited a comment on pull request #3323: Flink: Incrementally rewrite data files in streaming.

GitBox Fri, 05 Nov 2021 12:46:32 -0700


Reo-LEI edited a comment on pull request #3323:
URL: https://github.com/apache/iceberg/pull/3323#issuecomment-962011180

I‘m very grateful to @jackye1995 for his excellent work. I think Jack
summarized this PR from the background, implementation, benefits very well in
this
[document](https://docs.google.com/document/d/18N-xwZasXLNEl2407xT1oD08Mv9Kd3p0xMX7ZJFyV20/edit?usp=sharing).
And Jack proposed two approach in the document to improve the defects of this
PR in CDC case. I would like to make some additions and summaries to this
document, and continue to discuss the approach about how to deal with the
equality deletes compaction in CDC case at here. @rdblue @openinx
@stevenzwu @kbendick

At first, I want to discuss the effectiveness of current implementation of
streaming rewrite(apply delta deletes to delta data). We can measure this
effectiveness from the type of data stream and whether the table partition
type. For the data stream type, we can be divided into append stream and update
stream(include upsert and retract/cdc stream). For the table partition type, we
can devided into partitioned and unpartitioned, and partitioned can be further
divided into partitions that contain time and partitions that do not contain
time. For effectiveness, we use `o` to indicate the streaming rewrite is
effective that is mean the streaming rewrite can do the same and good as the
`RewriteDataFiles` action and user not need to run `RewriteDataFiles` in
addition, and we use `x` to indicate the streaming rewrite is not very
effective that is mean the streaming rewrite can not do well as the
`RewriteDataFiles` action and user still need to run `RewriteDataFiles` in
addition. Finally,
we can get the matrix as follow:
| unpartitioned | partitioned without time | partitioned with time
:--: | :--: | :--: | :--:
append stream | o | o | o
upsert stream | x | x | o
retract stream | x | x | o

For append stream, streaming rewrite can work well in all partition type
because there are no delete files and we only need to bin pack data files. For
update stream(upsert and retract stream), streaming rewrite can work well with
table of partitioned with time. Because the result will become certain over
time, such as we have an day and hour partition table and T partition result
will be certain in T+1 moment, so we can rewrite T partition after T. However,
streaming rewrite can not work well with table of partitioned without time, as
Jack mentioned in the doc, the high seqNum equality delete file still in
storage and when we read the table, they will still apply to the low seqNum
data files even if this data file has been rewritten before. So, I agree with
Jack that we should first find a way to deal with the remaining equality
deletes.

And then, I want to discuss the approach about how to deal with the
remaining equality deletes. Jack list two approach in the doc. The approach A
is gathering the equality delete files and then perform a table scan to rewrite
the affected data files. And the approach B is to convert the equality deletes
to position deletes and then perform a table scan to apply the position deletes
and rewrite the affected data files. Both approach require extra table scan and
data rewrite, I think these approach is work, but will be complicated and we
need to reconsider the cost and benefit.

Finally, I think we can have approach C that we can simply rewrite the delta
equality delete files to the new one and replace them by
[`rewriteFiles`](https://github.com/apache/iceberg/blob/d85a2b25cfa453d211043cf0868dd18de9827808/api/src/main/java/org/apache/iceberg/RewriteFiles.java#L63)
. As I said above, currently impelemention of streaming rewrite can not work
well with table of partitioned without time because the high seqNum equality
delete files(the delta equality delete files) still in storage, and we know the
streaming rewrite can apply all deletes to data files. Therefore, if we can
rewrite these delta delete files to a new one and bringing them from older
seqNum to a high seqNum(the latest commit seqNum), and then remove these delete
files from latest snapthot, we can keep the number of equaliy delete fies at a
low and same as the number of data files. In this approach, we no need extra
table scan and data rewrite, and we remove unnecessary equality delete files
and av
oid apply them to the rewritten data files.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] Reo-LEI edited a comment on pull request #3323: Flink: Incrementally rewrite data files in streaming.

Reply via email to