xabriel opened a new pull request #432: Allow writers to control size of files 
generated
URL: https://github.com/apache/incubator-iceberg/pull/432
 
 
   For big jobs where parquet files generated get to be >= 10GB, we have found 
latency on read related to reading the parquet footer.
   
   For our data and tech stack, we observe that it takes about 1 second per 
10GB of file size:
   
   ```
   Time to read footer of ~302 MB parquet file:
   ms: 273
   ms: 214
   ms: 289
   ms: 262
   
   Time to read footer of ~17GB parquet file:
   ms: 1907
   ms: 1925
   ms: 1933
   ms: 1855
   
   Time to read footer of ~67GB parquet file:
   ms: 6073
   ms: 5587
   ms: 5293
   ms: 5691
   ```
   
   To avoid this, we propose this PR that allows iceberg writers to close and 
open new files when a target file size is achieved. The semantics of having at 
most one file open per writers are not changed, and for the case of a 
`PartitionedWriter`, the semantics of failing if the data is not ordered is 
kept as well.
   
   With this PR, now we can do:
   ```
   df
     .sort(...)
     .write
     .format("iceberg")
     .option("target-file-size", 1 * 1024 * 1024 * 1024) // target 1GB files
     .mode("append")
     .save("...")
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to