Simon-Shlomo Poil created FLINK-34696:
-----------------------------------------
Summary: GSRecoverableWriterCommitter is generating excessive data
blobs
Key: FLINK-34696
URL: https://issues.apache.org/jira/browse/FLINK-34696
Project: Flink
Issue Type: Bug
Components: Connectors / FileSystem
Reporter: Simon-Shlomo Poil
In the "composeBlobs" method of
org.apache.flink.fs.gs.writer.GSRecoverableWriterCommitter
many small blobs are combined to generate a final single blob using the google
storage compose method. This compose action is performed iteratively each time
composing the resulting blob from the previous step with 31 new blobs until
there are not remaining blobs. When the compose action is completed the
temporary blobs are removed.
This unfortunately leads to significant excessive use of data storage (which
for google storage is a rather costly situation).
*Simple example*
We have 64 blobs each 1 GB; i.e. 64 GB
1st step: 32 blobs are composed into one blob; i.e. now 64 GB + 32 GB = 96 GB
2nd step: The 32 GB blob from previous step is composed with 31 blobs; now we
have 64 GB + 32 GB + 63 GB = 159 GB
3rd step: The last remaining blob is composed with the blob from the previous
step; now we have: 64 GB + 32 GB + 63 GB + 64 GB = 223 GB
I.e. in order to combine 64 GB of data we had an overhead of 159 GB.
*Why is this big issue?*
With large amount of data the overhead becomes significant. With TiB of data we
experienced peaks of PiB leading to unexpected high costs, and (maybe
unrelated) frequent storage exceptions thrown by the Google Storage library.
*Suggested solution:*
When the blobs are composed together they should be deleted to not duplicate
data.
Maybe this has implications for recoverability?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)