kbendick commented on issue #4875: URL: https://github.com/apache/iceberg/issues/4875#issuecomment-1138828561
Hi @Cqz666! So it's hard for me to comment on the time it takes for your Spark job to complete, as I'm not sure how much resources you have etc. Also, it does look like you have _many_ small files that are getting compacted. A few things I would suggest for making your compactions faster (as well as speeding up your ingest somewhat). First off, you can see many relevent configuration keys by looking at this code here: https://github.com/apache/iceberg/blob/128d5a161fda076118a7cab1d95ab5064400e08a/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java#L79-L87 . I would suggest that you download the project, import it into IntelliJ, and then jump around from there to look at relevant configs. As well as looking at the unit tests. As for things you can do to speed up the process: 1. Using avro for the Flink ingestion part, then rewriting the files as parquet. - This might be something you don't want to tackle right away (it's a large change to your pipeline), but parquet files are somewhat inefficient at such a small size, because they need things like their footer and row groups, which can increase file size. With so little data per file, using Avro might be better recommended. - This is completely allowed, as tables can have a mixture of file formats. I would suggest, if Flink is the primary writer to this job, setting the tables `write.format.default` to `avro` and then overriding that in the `options` when using Spark. Please try this on a non-production table first. 2. Breaking the job into several commit groups. - By default, compaction jobs write all of their newly written files in _one_ commit of up to 100GB per group. You might consider using the configuration `partial-progress.enabled` se to `true` to allow for work to be committed in more batches. This will cause the data to be committed in up to 10 groups by default, helping you reduce the amount of work that needs to be redone if the underlying table has changed. There's on-going work for smarter detection of change sets so less work needs to be replayed if another writer has committed to the table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
