wjones127 commented on a change in pull request #11911:
URL: https://github.com/apache/arrow/pull/11911#discussion_r769934178
##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +800,20 @@ def write_dataset(data, base_dir, basename_template=None,
format=None,
used determined by the number of available CPU cores.
max_partitions : int, default 1024
Maximum number of partitions any batch may be written into.
+ max_open_files : int, default 1024
+ Maximum number of number of files can be opened
+ max_rows_per_file : int, default 0
+ Maximum number of rows per file
Review comment:
Same here, helpful to have the extra guidance:
```suggestion
Maximum number of rows per file. If greater than 0 then this will
limit how many rows are placed in any single file. Otherwise there
will be no limit and one file will be created in each output
directory
unless files need to be closed to respect max_open_files
```
##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +800,20 @@ def write_dataset(data, base_dir, basename_template=None,
format=None,
used determined by the number of available CPU cores.
max_partitions : int, default 1024
Maximum number of partitions any batch may be written into.
+ max_open_files : int, default 1024
+ Maximum number of number of files can be opened
+ max_rows_per_file : int, default 0
+ Maximum number of rows per file
+ min_rows_per_group : int, default 0
+ Minimum number of rows per group. When the value is greater than 0,
+ the dataset writer will batch incoming data and only write the row
+ groups to the disk when sufficient rows have accumulated.
+ max_rows_per_group : int, default 1 << 20
Review comment:
Could we instead write the default like `1024 * 1024`? I find that
easier to think about and I don't think I'm alone in that.
```suggestion
max_rows_per_group : int, default 1024 * 1024
```
##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +800,20 @@ def write_dataset(data, base_dir, basename_template=None,
format=None,
used determined by the number of available CPU cores.
max_partitions : int, default 1024
Maximum number of partitions any batch may be written into.
+ max_open_files : int, default 1024
+ Maximum number of number of files can be opened
Review comment:
I found this a little confusing. Could we add the full docstring from
the C++ docs?
```suggestion
Maximum number of number of files that can be open at a time.
If an attempt is made to open too many files then the least recently
used file will be closed. If this setting is set too low you may end
up
fragmenting your data into many small files.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]