[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10628: ARROW-12364: [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

GitBox Wed, 07 Jul 2021 11:06:36 -0700


jorisvandenbossche commented on a change in pull request #10628:
URL: https://github.com/apache/arrow/pull/10628#discussion_r665598566




##########
File path: python/pyarrow/dataset.py
##########
@@ -731,6 +731,12 @@ def write_dataset(data, base_dir, basename_template=None, 
format=None,
         (e.g. S3)
     max_partitions : int, default 1024
         Maximum number of partitions any batch may be written into.
+    file_visitor : Function
+        If set, this function will be called with a WrittenFile instance
+        for each file created during the call.  This object will contain
+        the path and (if the dataset is a parquet dataset) the parquet

Review comment:
       The WrittenFile class is (currently) not exposed in the 
`pyarrow.dataset` namespace (and I think it is good to keep it that way, to not 
have users rely on the specific class), so I think we still need to be more 
explicit: eg "contain the path" -> "have a `path` attribute" 
   
   A small example might also help to illustrate, eg this one from the tests:
   
   ```
       visited_paths = []
   
       def file_visitor(written_file):
           visited_paths.append(written_file.path)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10628: ARROW-12364: [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

Reply via email to