[GitHub] [iceberg] Cqz666 opened a new issue, #4875: How to shorten the merge time of multi-partitioned tables

GitBox Thu, 26 May 2022 00:43:55 -0700


Cqz666 opened a new issue, #4875:
URL: https://github.com/apache/iceberg/issues/4875


   Hi team,
     I use Flink to write data from Kafka to Iceberg. Tables use date and event 
partitions like 'datekey='20220526' and 'event='xxxx'. 
     Event partitions generate more than 300 partitions per day.
     Because streaming writes produce a lot of small files(It will generate a 
Parquet file under their respective partitions almost every minute.), I tried 
using an official Spark program for data merging,as shown below:
   ```
   SparkActions
                       .get()
                       .rewriteDataFiles(table)
                       .filter(Expressions.equal("datekey", dateKey))
                       .filter(Expressions.equal("event", event))
                       .option("target-file-size-bytes", Long.toString(128 * 
1024 * 1024))
                       .execute();
   ```
   
![image](https://user-images.githubusercontent.com/50294123/170440638-bc7c6306-abfd-42cf-893b-1ae1c3585416.png)
   
     Unfortunately, it was too slow, taking almost half a day to merge all the 
data in one day.
     When I try a multi-threaded parallel merge, the commit of the metadata 
doesn't seem to work, giving this error:
     "Cannot commit: stale table metadata".
    
    Questions are as follows:
     1.How to shorten the data merge time of multi-partition table?
     2.How to coordinate the work of these three action 
'rewriteDataFiles','expireSnapshots','deleteOrphanFiles' , I don't know when to 
execute them and in what order.
     3.Will the iceberg support automatic merging of small files in the future? 
if yes, it would take a lot less extra work  
   
     Can someone answer my doubts?   
   
     ps. My iceberg version is  0.13.1.
      
   
   Best regards,
   
   Cqz
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] Cqz666 opened a new issue, #4875: How to shorten the merge time of multi-partitioned tables

Reply via email to