Cqz666 opened a new issue, #4875:
URL: https://github.com/apache/iceberg/issues/4875
Hi team,
I use Flink to write data from Kafka to Iceberg. Tables use date and event
partitions like 'datekey='20220526' and 'event='xxxx'.
Event partitions generate more than 300 partitions per day.
Because streaming writes produce a lot of small files(It will generate a
Parquet file under their respective partitions almost every minute.), I tried
using an official Spark program for data merging,as shown below:
```
SparkActions
.get()
.rewriteDataFiles(table)
.filter(Expressions.equal("datekey", dateKey))
.filter(Expressions.equal("event", event))
.option("target-file-size-bytes", Long.toString(128 *
1024 * 1024))
.execute();
```

Unfortunately, it was too slow, taking almost half a day to merge all the
data in one day.
When I try a multi-threaded parallel merge, the commit of the metadata
doesn't seem to work, giving this error:
"Cannot commit: stale table metadata".
Questions are as follows:
1.How to shorten the data merge time of multi-partition table?
2.How to coordinate the work of these three action
'rewriteDataFiles','expireSnapshots','deleteOrphanFiles' , I don't know when to
execute them and in what order.
3.Will the iceberg support automatic merging of small files in the future?
if yes, it would take a lot less extra work
Can someone answer my doubts?
ps. My iceberg version is 0.13.1.
Best regards,
Cqz
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]