[
https://issues.apache.org/jira/browse/ARROW-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446969#comment-16446969
]
ASF GitHub Bot commented on ARROW-1858:
---------------------------------------
xhochy closed pull request #1925: ARROW-1858: [Python] Added documentation for
pq.write_dataset
URL: https://github.com/apache/arrow/pull/1925
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/python/doc/source/parquet.rst b/python/doc/source/parquet.rst
index b509f17bf..df3cc625c 100644
--- a/python/doc/source/parquet.rst
+++ b/python/doc/source/parquet.rst
@@ -195,7 +195,7 @@ These settings can also be set on a per-column basis:
pq.write_table(table, where, compression={'foo': 'snappy', 'bar': 'gzip'},
use_dictionary=['foo', 'bar'])
-Reading Multiples Files and Partitioned Datasets
+Partitioned Datasets (Multiple Files)
------------------------------------------------
Multiple Parquet files constitute a Parquet *dataset*. These may present in a
@@ -225,6 +225,36 @@ A dataset partitioned by year and month may look like on
disk:
...
...
+Writing to Partitioned Datasets
+------------------------------------------------
+
+You can write a partitioned dataset for any ``pyarrow`` file system that is a
file-store (e.g. local, HDFS, S3). The
+default behaviour when no filesystem is added is to use the local filesystem.
+
+.. code-block:: python
+
+ # Local dataset write
+ pq.write_to_dataset(table, root_path='dataset_name',
partition_columns=['one', 'two'])
+
+The root path in this case specifies the parent directory to which data will
be saved. The partition columns are the
+column names by which to partition the dataset. Columns are partitioned in the
order they are given. The partition
+splits are determined by the unique values in the partition columns.
+
+To use another filesystem you only need to add the filesystem parameter, the
individual table writes are wrapped
+using ``with`` statements so the ``pq.write_to_dataset`` function does not
need to be.
+
+.. code-block:: python
+
+ # Remote file-system example
+ fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
+ pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['one',
'two'], filesystem=fs)
+
+Compatibility Note: if using ``pq.write_to_dataset`` to create a table that
will then be used by HIVE then partition
+column values must be compatible with the allowed character set of the HIVE
version you are running.
+
+Reading from Partitioned Datasets
+------------------------------------------------
+
The :class:`~.ParquetDataset` class accepts either a directory name or a list
or file paths, and can discover and infer some common partition structures,
such as those produced by Hive:
@@ -234,6 +264,18 @@ such as those produced by Hive:
dataset = pq.ParquetDataset('dataset_name/')
table = dataset.read()
+You can also use the convenience function ``read_table`` exposed by
``pyarrow.parquet``
+that avoids the need for an additional Dataset object creation step.
+
+.. code-block:: python
+
+ table = pq.read_table('dataset_name')
+
+Note: the partition columns in the original table will have their types
converted to Arrow dictionary types
+(pandas categorical) on load. Ordering of partition columns is not preserved
through the save/load process. If reading
+from a remote filesystem into a pandas dataframe you may need to run
``sort_index`` to maintain row ordering
+(as long as the ``preserve_index`` option was enabled on write).
+
Using with Spark
----------------
@@ -263,11 +305,11 @@ a parquet file into a Pandas dataframe.
This is suitable for executing inside a Jupyter notebook running on a Python 3
kernel.
-Dependencies:
+Dependencies:
-* python 3.6.2
-* azure-storage 0.36.0
-* pyarrow 0.8.0
+* python 3.6.2
+* azure-storage 0.36.0
+* pyarrow 0.8.0
.. code-block:: python
@@ -295,3 +337,4 @@ Notes:
* The ``account_key`` can be found under ``Settings -> Access keys`` in the
Microsoft Azure portal for a given container
* The code above works for a container with private access, Lease State =
Available, Lease Status = Unlocked
* The parquet file was Blob Type = Block blob
+
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [Python] Add documentation about parquet.write_to_dataset and related methods
> -----------------------------------------------------------------------------
>
> Key: ARROW-1858
> URL: https://issues.apache.org/jira/browse/ARROW-1858
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Wes McKinney
> Priority: Major
> Labels: beginner, pull-request-available
> Fix For: 0.10.0
>
>
> {{pyarrow}} does not only allow one to write to a single Parquet file but you
> can also write only the schema metadata for a full multi-file dataset. This
> dataset can also be automatically partitioned by one or more columns. At the
> moment, this functionality is not really visible in the documentation. You
> mainly find the API documentation for it but we should have a small
> tutorial-like section that explains the differences and use cases for each of
> these functions.
> See also
> https://stackoverflow.com/questions/47482434/can-pyarrow-write-multiple-parquet-files-to-a-folder-like-fastparquets-file-sch
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)