This is an automated email from the ASF dual-hosted git repository.
alenka pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 433f79526b ARROW-17046 [Python] improve documentation of
pyarrow.parquet.write_to_dataset function (#13591)
433f79526b is described below
commit 433f79526bd21cd2d6cc1832294f70ff4e6cce53
Author: Amir Khosroshahi <[email protected]>
AuthorDate: Thu Jul 21 04:22:33 2022 -0400
ARROW-17046 [Python] improve documentation of
pyarrow.parquet.write_to_dataset function (#13591)
This patch is an attempt to make the documentation of
`pyarrow.parquet.write_to_dataset` function clearer so that the user can easily
learn
- Which parameters are used by the new code path and which ones are used by
the legacy code path
- How kwargs are handled. That is, which underlying function that
`pyarrow.parquet.write_to_dataset` is a wrapper around they are passed to
Authored-by: Amir Khosroshahi <[email protected]>
Signed-off-by: Alenka Frim <[email protected]>
---
python/pyarrow/parquet/__init__.py | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
diff --git a/python/pyarrow/parquet/__init__.py
b/python/pyarrow/parquet/__init__.py
index 61f8cdf709..868a83f0eb 100644
--- a/python/pyarrow/parquet/__init__.py
+++ b/python/pyarrow/parquet/__init__.py
@@ -3018,7 +3018,8 @@ def write_to_dataset(table, root_path,
partition_cols=None,
use_threads=None, file_visitor=None,
existing_data_behavior=None,
**kwargs):
- """Wrapper around parquet.write_table for writing a Table to
+ """Wrapper around dataset.write_dataset (when use_legacy_dataset=False) or
+ parquet.write_table (when use_legacy_dataset=True) for writing a Table to
Parquet format by partitions.
For each combination of partition columns and values,
a subdirectories are created in the following
@@ -3052,6 +3053,9 @@ def write_to_dataset(table, root_path,
partition_cols=None,
A callback function that takes the partition key(s) as an argument
and allow you to override the partition filename. If nothing is
passed, the filename will consist of a uuid.
+ This option is only supported for use_legacy_dataset=True.
+ When use_legacy_dataset=None and this option is specified,
+ use_legacy_datase will be set to True.
use_legacy_dataset : bool
Default is False. Set to True to use the the legacy behaviour
(this option is deprecated, and the legacy implementation will be
@@ -3061,17 +3065,21 @@ def write_to_dataset(table, root_path,
partition_cols=None,
use_threads : bool, default True
Write files in parallel. If enabled, then maximum parallelism will be
used determined by the number of available CPU cores.
+ This option is only supported for use_legacy_dataset=False.
schema : Schema, optional
+ This option is only supported for use_legacy_dataset=False.
partitioning : Partitioning or list[str], optional
The partitioning scheme specified with the
``pyarrow.dataset.partitioning()`` function or a list of field names.
When providing a list of field names, you can use
``partitioning_flavor`` to drive which partitioning type should be
used.
+ This option is only supported for use_legacy_dataset=False.
basename_template : str, optional
A template string used to generate basenames of written data files.
The token '{i}' will be replaced with an automatically incremented
integer. If not specified, it defaults to "guid-{i}.parquet".
+ This option is only supported for use_legacy_dataset=False.
file_visitor : function
If set, this function will be called with a WrittenFile instance
for each file created during the call. This object will have both
@@ -3091,16 +3099,12 @@ def write_to_dataset(table, root_path,
partition_cols=None,
def file_visitor(written_file):
visited_paths.append(written_file.path)
+ This option is only supported for use_legacy_dataset=False.
existing_data_behavior : 'overwrite_or_ignore' | 'error' | \
'delete_matching'
Controls how the dataset will handle data that already exists in
the destination. The default behaviour is 'overwrite_or_ignore'.
- Only used in the new code path using the new Arrow Dataset API
- (``use_legacy_dataset=False``). In case the legacy implementation
- is selected the parameter is ignored as the old implementation does
- not support it (only has the default behaviour).
-
'overwrite_or_ignore' will ignore any existing data and will
overwrite files with the same name as an output file. Other
existing files will be ignored. This behavior, in combination
@@ -3113,9 +3117,15 @@ def write_to_dataset(table, root_path,
partition_cols=None,
dataset. The first time each partition directory is encountered
the entire directory will be deleted. This allows you to overwrite
old partitions completely.
+ This option is only supported for use_legacy_dataset=False.
**kwargs : dict,
- Additional kwargs for write_table function. See docstring for
- `write_table` or `ParquetWriter` for more information.
+ When use_legacy_dataset=False, used as additional kwargs for
+ `dataset.write_dataset` function (passed to
+ `ParquetFileFormat.make_write_options`). See the docstring
+ of `write_table` for the available options.
+ When use_legacy_dataset=True, used as additional kwargs for
+ `parquet.write_table` function (See docstring for `write_table`
+ or `ParquetWriter` for more information).
Using `metadata_collector` in kwargs allows one to collect the
file metadata instances of dataset pieces. The file paths in the
ColumnChunkMetaData will be set relative to `root_path`.