jorisvandenbossche commented on issue #34264:
URL: https://github.com/apache/arrow/issues/34264#issuecomment-1438345998
Yes, the `ParquetWriter` interface is the low-level interface for writing
_single_ files (and so using that you need to handle this logic manually), but
the generic dataset writing functionality allows you to control file size in
_some_ way and thus automatically split your dataset in multiple files. But,
this is based on the number of rows written and not the resulting file size.
You could still use that if you can make a rough estimate of rows for a given
size.
How this would look like:
```
>>> table = pa.table({"col": range(10000)})
>>> import pyarrow.dataset as ds
>>> ds.write_dataset(table, "test_split", format="parquet",
max_rows_per_file=3000, max_rows_per_group=3000)
```
```
$ ls test_split/
part-0.parquet part-1.parquet part-2.parquet part-3.parquet
```
(I needed to specify `max_rows_per_group` as well, but that's just because I
used a tiny example and that keyword has a default that is larger than 3000)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]