[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

GitBox Tue, 19 Apr 2022 09:40:39 -0700


jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r853271087



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2962,11 +2964,43 @@ def write_to_dataset(table, root_path, 
partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
+    use_threads : bool, default True
+        Write files in parallel. If enabled, then maximum parallelism will be
+        used determined by the number of available CPU cores.
+    schema : Schema, optional
+    partitioning : Partitioning or list[str], optional
+        The partitioning scheme specified with the ``partitioning()``

Review Comment:
   ```suggestion
           The partitioning scheme specified with the 
``pyarrow.dataset.partitioning()``
   ```



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -3011,7 +3012,8 @@ def _create_parquet_dataset_simple(root_path):
     for i in range(4):
         table = pa.table({'f1': [i] * 10, 'f2': np.random.randn(10)})
         pq.write_to_dataset(
-            table, str(root_path), metadata_collector=metadata_collector
+            table, str(root_path), use_legacy_dataset=True,
+            metadata_collector=metadata_collector

Review Comment:
   Was there a reason for specifying True here specifically? 
(`metadata_collector` should be supported with `use_legacy_dataset=False` as 
well)



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -937,7 +937,7 @@ def _create_dataset_for_fragments(tempdir, chunk_size=None, 
filesystem=None):
     path = str(tempdir / "test_parquet_dataset")
 
     # write_to_dataset currently requires pandas
-    pq.write_to_dataset(table, path,
+    pq.write_to_dataset(table, path, use_legacy_dataset=True,
                         partition_cols=["part"], chunk_size=chunk_size)

Review Comment:
   Maybe we can open a follow-up JIRA for this one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Reply via email to