jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r845304218
##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -937,7 +937,7 @@ def _create_dataset_for_fragments(tempdir, chunk_size=None,
filesystem=None):
path = str(tempdir / "test_parquet_dataset")
# write_to_dataset currently requires pandas
- pq.write_to_dataset(table, path,
+ pq.write_to_dataset(table, path, use_legacy_dataset=True,
partition_cols=["part"], chunk_size=chunk_size)
Review Comment:
So here this fails with using the new dataset implementation, because
`dataset.write_dataset(..)` doesn't support the parquet `row_group_size`
keyword (to which `chunk_size` gets translated). The `ParquetFileWriteOptions`
doesn't support this keyword.
On the parquet side, this is also the only keyword that is not passed to the
`ParquetWriter` init (and thus to parquet's `WriterProperties` or
`ArrowWriterProperties`), but to the actual `write_table` call. In C++ this can
be seen at
https://github.com/apache/arrow/blob/76d064c729f5e2287bf2a2d5e02d1fb192ae5738/cpp/src/parquet/arrow/writer.h#L62-L71
cc @westonpace do you remember if this has been discussed before how the
`row_group_size`/`chunk_size` setting from Parquet fits into the dataset API?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]