[GitHub] [iceberg] kbendick commented on issue #4875: How to shorten the merge time of multi-partitioned tables

GitBox Thu, 26 May 2022 10:31:04 -0700


kbendick commented on issue #4875:
URL: https://github.com/apache/iceberg/issues/4875#issuecomment-1138828561

Hi @Cqz666! So it's hard for me to comment on the time it takes for your
Spark job to complete, as I'm not sure how much resources you have etc. Also,
it does look like you have _many_ small files that are getting compacted.

A few things I would suggest for making your compactions faster (as well as
speeding up your ingest somewhat).

First off, you can see many relevent configuration keys by looking at this
code here:
https://github.com/apache/iceberg/blob/128d5a161fda076118a7cab1d95ab5064400e08a/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java#L79-L87
. I would suggest that you download the project, import it into IntelliJ, and
then jump around from there to look at relevant configs. As well as looking at
the unit tests.

As for things you can do to speed up the process:

1. Using avro for the Flink ingestion part, then rewriting the files as
parquet.
- This might be something you don't want to tackle right away (it's a
large change to your pipeline), but parquet files are somewhat inefficient at
such a small size, because they need things like their footer and row groups,
which can increase file size. With so little data per file, using Avro might be
better recommended.
- This is completely allowed, as tables can have a mixture of file
formats. I would suggest, if Flink is the primary writer to this job, setting
the tables `write.format.default` to `avro` and then overriding that in the
`options` when using Spark. Please try this on a non-production table first.
2. Breaking the job into several commit groups.
- By default, compaction jobs write all of their newly written files in
_one_ commit of up to 100GB per group. You might consider using the
configuration `partial-progress.enabled` se to `true` to allow for work to be
committed in more batches. This will cause the data to be committed in up to 10
groups by default, helping you reduce the amount of work that needs to be
redone if the underlying table has changed. There's on-going work for smarter
detection of change sets so less work needs to be replayed if another writer
has committed to the table.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on issue #4875: How to shorten the merge time of multi-partitioned tables

Reply via email to