[GitHub] [arrow] jorisvandenbossche commented on issue #34264: [Python] Control file size writing Parquet files

via GitHub Tue, 21 Feb 2023 03:51:58 -0800


jorisvandenbossche commented on issue #34264:
URL: https://github.com/apache/arrow/issues/34264#issuecomment-1438345998


   Yes, the `ParquetWriter` interface is the low-level interface for writing 
_single_ files (and so using that you need to handle this logic manually), but 
the generic dataset writing functionality allows you to control file size in 
_some_ way and thus automatically split your dataset in multiple files. But, 
this is based on the number of rows written and not the resulting file size. 
You could still use that if you can make a rough estimate of rows for a given 
size.
   
   How this would look like:
   
   ```
   >>> table = pa.table({"col": range(10000)})
   >>> import pyarrow.dataset as ds
   >>> ds.write_dataset(table, "test_split", format="parquet", 
max_rows_per_file=3000, max_rows_per_group=3000)
   ```
   
   ```
   $ ls test_split/
   part-0.parquet       part-1.parquet  part-2.parquet  part-3.parquet
   ```
   
   (I needed to specify `max_rows_per_group` as well, but that's just because I 
used a tiny example and that keyword has a default that is larger than 3000)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #34264: [Python] Control file size writing Parquet files

Reply via email to