[GitHub] [arrow-cookbook] jorisvandenbossche commented on a change in pull request #47: Writing Partitioned Datasets recipe for Python

GitBox Mon, 23 Aug 2021 08:05:40 -0700


jorisvandenbossche commented on a change in pull request #47:
URL: https://github.com/apache/arrow-cookbook/pull/47#discussion_r694057733




##########
File path: python/source/io.rst
##########
@@ -217,6 +217,65 @@ provided to :func:`pyarrow.csv.read_csv` to drive
     col1: int64
     ChunkedArray = 0 .. 99
 
+Writing Partitioned Datasets 
+============================
+
+When your dataset is big it usually makes sense to split it into
+multiple separate files. You can do this manually or use 
+:func:`pyarrow.dataset.write_dataset` to let Arrow do the effort
+of splitting the data in chunks for you.
+
+The ``partitioning`` argument allows to tell 
:func:`pyarrow.dataset.write_dataset`
+for which columns the data should be split. 
+
+For example given 100 birthdays, within 2000 and 2009
+
+.. testcode::
+
+    import numpy.random
+    data = pa.table({"day": numpy.random.randint(1, 31, size=100), 
+                     "month": numpy.random.randint(1, 12, size=100),
+                     "year": [2000 + x // 10 for x in range(100)]})
+
+Then we could partition the data by the year column so that it
+gets saved in 10 different files:
+
+.. testcode::
+
+    import pyarrow as pa
+    import pyarrow.dataset as ds
+
+    ds.write_dataset(data, "./partitioned", format="parquet",
+                     partitioning=ds.partitioning(pa.schema([("year", 
pa.int16())])))
+
+Arrow will partition datasets in subdirectories by default, which will
+result in 10 different directories named with the value of the partitioning
+column and with file containing the data partition inside:

Review comment:
       the "and with file containing the data partition inside" reads a bit 
strange. Maybe something like "each with a file containing the subset of the 
data for that partition"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-cookbook] jorisvandenbossche commented on a change in pull request #47: Writing Partitioned Datasets recipe for Python

Reply via email to