[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

GitBox Thu, 07 Apr 2022 08:58:25 -0700


jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r845304218



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -937,7 +937,7 @@ def _create_dataset_for_fragments(tempdir, chunk_size=None, 
filesystem=None):
     path = str(tempdir / "test_parquet_dataset")
 
     # write_to_dataset currently requires pandas
-    pq.write_to_dataset(table, path,
+    pq.write_to_dataset(table, path, use_legacy_dataset=True,
                         partition_cols=["part"], chunk_size=chunk_size)

Review Comment:
   So here this fails with using the new dataset implementation, because 
`dataset.write_dataset(..)` doesn't support the parquet `row_group_size` 
keyword (to which `chunk_size` gets translated). The `ParquetFileWriteOptions` 
doesn't support this keyword. 
   
   On the parquet side, this is also the only keyword that is not passed to the 
`ParquetWriter` init (and thus to parquet's `WriterProperties` or 
`ArrowWriterProperties`), but to the actual `write_table` call. In C++ this can 
be seen at
   
   
https://github.com/apache/arrow/blob/76d064c729f5e2287bf2a2d5e02d1fb192ae5738/cpp/src/parquet/arrow/writer.h#L62-L71
   
   cc @westonpace do you remember if this has been discussed before how the 
`row_group_size`/`chunk_size` setting from Parquet fits into the dataset API?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Reply via email to