[GitHub] [arrow] bkietz commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

GitBox Tue, 20 Jul 2021 05:05:24 -0700


bkietz commented on a change in pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#discussion_r672508680




##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,165 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually listing the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
------------------
+Iterative (out of core or streaming) reads
+------------------------------------------
 
-..
-    Possible content:
-    - fragments (get_fragments)
-    - scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table 
using :func:`~Dataset.to_table`.  This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-    for scan_task in dataset.scan(columns=[...], filter=...):
-        for record_batch in scan_task.execute():
-            # process the record batch
+.. ipython:: python
+
+    import pyarrow.compute as pc
+
+    col2_sum = 0
+    count = 0
+    for batch in dataset.to_batches(columns=["col2"], 
filter=~ds.field("col2").is_null()):
+        col2_sum += pc.sum(batch.column("col2")).as_py()
+        count += batch.num_rows
+    mean_a = col2_sum/count
+
+Customizing the batch size
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+An iterative read of a dataset is often called a "scan" of the dataset and 
pyarrow
+uses an object called a :class:`Scanner` to do this.  A Scanner is created for 
you
+automatically by the to_table and to_batches method of the dataset.  Any 
arguments
+you pass to these methods will be passed on to the Scanner constructor.
+
+One of those parameters is the ``batch_size``.  This controls the maximum size 
of the
+batches returned by the scanner.  Batches can still be smaller than the 
`batch_size`
+if the dataset consists of small files or those files themselves consist of 
small
+row groups.  For example, a parquet file with 10,000 rows per row group will 
yield
+batches with, at most, 10,000 rows unless the batch_size is set to a smaller 
value.
+
+The default batch size is one million rows and this is typically a good 
default but
+you may want to customize it if you are reading a large number of columns.
+
+Writing Datasets
+----------------
+
+The dataset API also simplifies writing data to a dataset using 
:func:`write_dataset` .  This can be useful when
+you want to partition your data or you need to write a large amount of data.  A
+basic dataset write is similar to writing a table except that you specify a 
directory
+instead of a filename.
+
+.. ipython:: python
+
+    base = pathlib.Path(tempfile.gettempdir())
+    dataset_root = base / "sample_dataset"
+    dataset_root.mkdir(exist_ok=True)
+
+    table = pa.table({"a": range(10), "b": np.random.randn(10), "c": [1, 2] * 
5})
+    ds.write_dataset(table, dataset_root, format="parquet")
+
+The above example will create a single file named part-0.parquet in our 
sample_dataset
+directory.
+
+.. warning::
+
+    If you run the example again it will replace the existing part-0.parquet 
file.
+    Appending files to an existing dataset is not currently supported by this 
API and
+    the output directory should be empty for predictable results.

Review comment:
       ```suggestion
       Appending files to an existing dataset requires specifying a new
       ``basename_template`` for each call to ``ds.write_dataset``
       to avoid overwrite.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] bkietz commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

Reply via email to