vibhatha commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r826599846
##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when
reading and allow data
based on statistics, but very small groups can cause metadata to be a
significant portion
of file size. Arrow's file writer provides sensible defaults for group sizing
in most cases.
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be
+important to optimize the writes, i.e number of rows per file and
+number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If ``max_open_files`` is set greater than 0 then this will limit the maximum
+number of files that can be left open. If an attempt is made to open too many
+files then the least recently used file will be closed. If this setting is
set
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open
+by the scannerbefore hitting the default Linux limit of 1024. Modify this
value
+depending on the nature of write operations associated with the usage.
+
+Another important configuration used in `write_dataset` is
``max_rows_per_file``.
+
+Set the maximum number of files opened with the ``max_rows_per_files``
parameter of
+:meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed
to respect
+``max_open_files``.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained,
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured.
+This can be configured using a minimum and maximum parameter.
+
Review comment:
I guess we can think of logging activities where online activities are
monitored in windows (window aggregations) and summaries are logged by
computing on those aggregated values. So if we assume such a scenario,
depending on the accuracy required for the computation (if it is a learning
task) and the required performance optimizations (execution time and memory),
the users should be able to tune the parameter. This could be an interesting
blog article if we can demonstrate it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]