xabriel opened a new pull request #432: Allow writers to control size of files generated URL: https://github.com/apache/incubator-iceberg/pull/432 For big jobs where parquet files generated get to be >= 10GB, we have found latency on read related to reading the parquet footer. For our data and tech stack, we observe that it takes about 1 second per 10GB of file size: ``` Time to read footer of ~302 MB parquet file: ms: 273 ms: 214 ms: 289 ms: 262 Time to read footer of ~17GB parquet file: ms: 1907 ms: 1925 ms: 1933 ms: 1855 Time to read footer of ~67GB parquet file: ms: 6073 ms: 5587 ms: 5293 ms: 5691 ``` To avoid this, we propose this PR that allows iceberg writers to close and open new files when a target file size is achieved. The semantics of having at most one file open per writers are not changed, and for the case of a `PartitionedWriter`, the semantics of failing if the data is not ordered is kept as well. With this PR, now we can do: ``` df .sort(...) .write .format("iceberg") .option("target-file-size", 1 * 1024 * 1024 * 1024) // target 1GB files .mode("append") .save("...") ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
