kbendick commented on issue #4875:
URL: https://github.com/apache/iceberg/issues/4875#issuecomment-1138828561

   Hi @Cqz666! So it's hard for me to comment on the time it takes for your 
Spark job to complete, as I'm not sure how much resources you have etc. Also, 
it does look like you have _many_ small files that are getting compacted.
   
   A few things I would suggest for making your compactions faster (as well as 
speeding up your ingest somewhat).
   
   First off, you can see many relevent configuration keys by looking at this 
code here: 
https://github.com/apache/iceberg/blob/128d5a161fda076118a7cab1d95ab5064400e08a/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java#L79-L87
 . I would suggest that you download the project, import it into IntelliJ, and 
then jump around from there to look at relevant configs. As well as looking at 
the unit tests.
   
   
   As for things you can do to speed up the process:
   
   1. Using avro for the Flink ingestion part, then rewriting the files as 
parquet.
       - This might be something you don't want to tackle right away (it's a 
large change to your pipeline), but parquet files are somewhat inefficient at 
such a small size, because they need things like their footer and row groups, 
which can increase file size. With so little data per file, using Avro might be 
better recommended.
       - This is completely allowed, as tables can have a mixture of file 
formats. I would suggest, if Flink is the primary writer to this job, setting 
the tables `write.format.default` to `avro` and then overriding that in the 
`options` when using Spark. Please try this on a non-production table first.
   2. Breaking the job into several commit groups.
       - By default, compaction jobs write all of their newly written files in 
_one_ commit of up to 100GB per group. You might consider using the 
configuration `partial-progress.enabled` se to `true` to allow for work to be 
committed in more batches. This will cause the data to be committed in up to 10 
groups by default, helping you reduce the amount of work that needs to be 
redone if the underlying table has changed. There's on-going work for smarter 
detection of change sets so less work needs to be replayed if another writer 
has committed to the table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to