[GitHub] [arrow] wjones127 commented on a change in pull request #11911: ARROW-15019: [Python] Add bindings for new dataset writing options

GitBox Wed, 15 Dec 2021 11:21:51 -0800


wjones127 commented on a change in pull request #11911:
URL: https://github.com/apache/arrow/pull/11911#discussion_r769934178




##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +800,20 @@ def write_dataset(data, base_dir, basename_template=None, 
format=None,
         used determined by the number of available CPU cores.
     max_partitions : int, default 1024
         Maximum number of partitions any batch may be written into.
+    max_open_files : int, default 1024
+        Maximum number of number of files can be opened
+    max_rows_per_file : int, default 0
+        Maximum number of rows per file

Review comment:
       Same here, helpful to have the extra guidance:
   
   ```suggestion
           Maximum number of rows per file. If greater than 0 then this will 
           limit how many rows are placed in any single file. Otherwise there 
           will be no limit and one file will be created in each output 
directory 
           unless files need to be closed to respect max_open_files
   ```

##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +800,20 @@ def write_dataset(data, base_dir, basename_template=None, 
format=None,
         used determined by the number of available CPU cores.
     max_partitions : int, default 1024
         Maximum number of partitions any batch may be written into.
+    max_open_files : int, default 1024
+        Maximum number of number of files can be opened
+    max_rows_per_file : int, default 0
+        Maximum number of rows per file
+    min_rows_per_group : int, default 0
+        Minimum number of rows per group. When the value is greater than 0,
+        the dataset writer will batch incoming data and only write the row
+        groups to the disk when sufficient rows have accumulated.
+    max_rows_per_group : int, default 1 << 20

Review comment:
       Could we instead write the default like `1024 * 1024`? I find that 
easier to think about and I don't think I'm alone in that.
   
   ```suggestion
       max_rows_per_group : int, default 1024 * 1024
   ```

##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +800,20 @@ def write_dataset(data, base_dir, basename_template=None, 
format=None,
         used determined by the number of available CPU cores.
     max_partitions : int, default 1024
         Maximum number of partitions any batch may be written into.
+    max_open_files : int, default 1024
+        Maximum number of number of files can be opened

Review comment:
       I found this a little confusing. Could we add the full docstring from 
the C++ docs?
   
   ```suggestion
           Maximum number of number of files that can be open at a time. 
           If an attempt is made to open too many files then the least recently 
           used file will be closed. If this setting is set too low you may end 
up
           fragmenting your data into many small files.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones127 commented on a change in pull request #11911: ARROW-15019: [Python] Add bindings for new dataset writing options

Reply via email to