jorisvandenbossche commented on a change in pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#discussion_r764011810
##########
File path: docs/source/python/dataset.rst
##########
@@ -340,6 +340,30 @@ when constructing a directory partitioning:
Directory partitioning also supports providing a full schema rather than
inferring
types from file paths.
+Automatic partitioning detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the directory is partitioned using the hive partitioning scheme (see above)
+then pyarrow will be able to automatically recognize the partitioning and
include
+the partitioning information as a column in the returned table. There is no
+need to specify the partitioning unless you need to override the inferred data
+types of the partitioning columns:
+
+.. code-block:: python
+
+ dataset = ds.dataset("hive_partitioned", format="parquet")
Review comment:
But then also for writing? (which currently defaults to directory
partitioning)
I think that will certainly give a smoother roundtrip experience. I am a bit
unsure about the change in behaviour (for reading that seems harmless, since it
will leave alone directories that doesn't match a hive scheme, but changing
writing from directory to hive might be a bigger change)
I can't directly remember why we initially chose directory partitioning as
the default ..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]