[GitHub] [arrow] lidavidm commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

GitBox Mon, 12 Jul 2021 12:50:20 -0700


lidavidm commented on a change in pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#discussion_r667897191




##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
------------------
+Iterative (out of core or streaming) reads
+------------------------------------------
 
-..
-    Possible content:
-    - fragments (get_fragments)
-    - scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-    for scan_task in dataset.scan(columns=[...], filter=...):
-        for record_batch in scan_task.execute():
-            # process the record batch
+.. ipython:: python
+
+    import pyarrow.compute as pc
+
+    sum = 0

Review comment:
       nit: this shadows a builtin

##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
------------------
+Iterative (out of core or streaming) reads
+------------------------------------------
 
-..
-    Possible content:
-    - fragments (get_fragments)
-    - scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-    for scan_task in dataset.scan(columns=[...], filter=...):
-        for record_batch in scan_task.execute():
-            # process the record batch
+.. ipython:: python
+
+    import pyarrow.compute as pc
+
+    sum = 0
+    count = 0
+    for batch in dataset.to_batches(filter=~ds.field('col2').is_null()):
+        sum += pc.sum(batch.column('col2')).as_py()
+        count += batch.num_rows
+    mean_a = sum/count

Review comment:
       Though, wouldn't the 'right' way to do it be to select `mean('col2')` as 
an expression? (I suppose you can't yet do that though.)

##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
------------------
+Iterative (out of core or streaming) reads
+------------------------------------------
 
-..
-    Possible content:
-    - fragments (get_fragments)
-    - scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-    for scan_task in dataset.scan(columns=[...], filter=...):
-        for record_batch in scan_task.execute():
-            # process the record batch
+.. ipython:: python

Review comment:
       Do we actually want all these to be ipython blocks instead of code-block 
blocks? It does slow down building a bit.

##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning

Review comment:
       We do sort by path: 
   
https://github.com/apache/arrow/blob/91f261fa9a7841fd914c5ed1d8e747fb4e510a5b/cpp/src/arrow/dataset/discovery.cc#L203-L204
   
   Though it states that's only for unit tests. However I would personally be 
really surprised if data were truly returned in random order and as we've had 
this behavior in a few releases now I think users would also be surprised. 
Maybe we should instead guarantee lexicographic sort on paths?

##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
------------------
+Iterative (out of core or streaming) reads
+------------------------------------------
 
-..
-    Possible content:
-    - fragments (get_fragments)
-    - scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-    for scan_task in dataset.scan(columns=[...], filter=...):
-        for record_batch in scan_task.execute():
-            # process the record batch
+.. ipython:: python
+
+    import pyarrow.compute as pc
+
+    sum = 0
+    count = 0
+    for batch in dataset.to_batches(filter=~ds.field('col2').is_null()):
+        sum += pc.sum(batch.column('col2')).as_py()
+        count += batch.num_rows
+    mean_a = sum/count
+
+Customizing the batch size
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+An iterative read of a dataset is often called a "scan" of the dataset and 
pyarrow
+uses an object called a :class:`Scanner` to do this.  A Scanner is created for 
you
+automatically by the to_table and to_batches method of the dataset.  Any 
arguments
+you pass to these methods will be passed on to the Scanner constructor.
+
+One of those parameters is the `batch_size`.  This controls the maximum size 
of the
+batches returned by the scanner.  Batches can still be smaller than the 
`batch_size`
+if the dataset consists of small files or those files themselves consist of 
small
+row groups.  For example, a parquet file with 10,000 rows per row group will 
yield
+batches with, at most, 10,000 rows unless the batch_size is set to a smaller 
value.
+
+The default batch size is one million rows and this is typically a good 
default but
+you may want to customize it if you are reading a large number of columns.
+
+Writing Datasets
+----------------
+
+The dataset API also simplifies writing data to a dataset.  This can be useful 
when
+you want to partition your data or you need to write a large amount of data.  A
+basic dataset write is similar to writing a table except that you specify a 
directory
+instead of a filename.
+
+.. ipython:: python
+
+    base = pathlib.Path(tempfile.gettempdir())
+    dataset_root = base / "sample_dataset"
+    dataset_root.mkdir(exist_ok=True)
+
+    table = pa.table({"a": range(10), "b": np.random.randn(10), "c": [1, 2] * 
5})
+    ds.write_dataset(table, dataset_root, format="parquet")
+
+The above example will create a single file named part-0.parquet in our 
sample_dataset
+diretory.
+
+.. warning::
+
+    If you run the example again it will replace the existing part-0.parquet 
file.
+    Appending files to an existing dataset is not currently supported by this 
API and
+    the output directory should be empty for predictable results.
+
+Writing partitioned data
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A partitioning object can be used to specify how your output data should be 
partitioned.
+This uses the same kind of partitioning objects we used for reading datasets.  
To write
+our above data out to a partitioned directory we only need to specify how we 
want the
+dataset to be partitioned.  For example:
+
+.. ipython:: python
+
+    part = ds.partitioning(
+        pa.schema([("c", pa.int16())]), flavor="hive"
+    )
+    ds.write_dataset(table, dataset_root, format="parquet", partitioning=part)
+
+This will create two files.  Half our data will be in the dataset_root/c=1 
directory and
+the other half will be in the dataset_root/c=2 directory.
+
+Writing large amounts of data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The above examples wrote data from a table.  If you are writing a large amount 
of data
+you may not be able to load everything into a single in-memory table.  
Fortunately, the
+write_dataset method also accepts an iterable of record batches.  This makes 
it really
+simple, for example, to repartition a large dataset without loading the entire 
dataset
+into memory:
+
+.. ipython:: python
+
+    old_part = ds.partitioning(
+        pa.schema([("c", pa.int16())]), flavor="hive"
+    )
+    new_part = ds.partitioning(
+        pa.schema([("c", pa.int16())]), flavor=None
+    )
+    input_dataset = ds.dataset(dataset_root, partitioning=old_part)
+    new_root = base / "repartitioned_dataset"
+    # A scanner can act as an iterator of record batches but you could also 
receive
+    # data from the network (e.g. via flight), from your own scanning, or from 
any
+    # other method that yields record batches.  In addition, you can pass a 
dataset
+    # into write_dataset directly but this method is useful if you want to 
customize
+    # the scanner (e.g. to filter the input dataset or set a maximum batch 
size)
+    scanner = input_dataset.scanner()
+
+    ds.write_dataset(scanner, new_root, format="parquet", 
partitioning=new_part)
+
+After the above example runs our data will be in dataset_root/1 and 
dataset_root/2
+directories.  You could also use this mechnaism to change which columns you 
are partitioned
+on as well.  This is useful when you expect to query your data in specific 
ways and
+you can utilize partitioning to reduce the amount of data you need to read.
+
+.. To add when ARROW-12364 is merged
+    Customizing & inspecting written files
+    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+    By default the dataset API will create files named "part-i.format" where 
"i" is a integer
+    generated during the write and "format" is the file format specified in 
the write_dataset
+    call.  For simple datasets it may be possible to know which files will be 
created but for
+    larger or partitioned datasets it is not so easy.  The ``file_visitor`` 
keyword can be used 
+    to supply a visitor that will be called as each file is created:
+
+    .. ipython:: python
+
+        def file_visitor(written_file):
+            print(f"path={written_file.path}")
+            print(f"metadata={written_file.metadata}")
+        ds.write_dataset(table, dataset_root, format="parquet", 
partitioning=part,
+                        file_visitor=file_visitor)
+
+    This will allow you to collect the filenames that belong to the dataset 
and store them elsewhere
+    which can be useful when you want to avoid scanning directories the next 
time you need to read
+    the data.  It can also be used to generate the _metadata index file used 
by other tools such as
+    dask or spark to create an index of the dataset.
+
+Configuring format-specific parameters during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In addition to the common options shared by all formats there are also format 
specific options
+that are unique to a particular format.  For example, to allow truncated 
timestamps while writing
+parquet files:

Review comment:
       Is it worth linking to the options classes for each of the formats?

##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
------------------
+Iterative (out of core or streaming) reads
+------------------------------------------
 
-..
-    Possible content:
-    - fragments (get_fragments)
-    - scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-    for scan_task in dataset.scan(columns=[...], filter=...):
-        for record_batch in scan_task.execute():
-            # process the record batch
+.. ipython:: python
+
+    import pyarrow.compute as pc
+
+    sum = 0
+    count = 0
+    for batch in dataset.to_batches(filter=~ds.field('col2').is_null()):
+        sum += pc.sum(batch.column('col2')).as_py()
+        count += batch.num_rows
+    mean_a = sum/count
+
+Customizing the batch size
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+An iterative read of a dataset is often called a "scan" of the dataset and 
pyarrow
+uses an object called a :class:`Scanner` to do this.  A Scanner is created for 
you
+automatically by the to_table and to_batches method of the dataset.  Any 
arguments
+you pass to these methods will be passed on to the Scanner constructor.
+
+One of those parameters is the `batch_size`.  This controls the maximum size 
of the
+batches returned by the scanner.  Batches can still be smaller than the 
`batch_size`
+if the dataset consists of small files or those files themselves consist of 
small
+row groups.  For example, a parquet file with 10,000 rows per row group will 
yield
+batches with, at most, 10,000 rows unless the batch_size is set to a smaller 
value.
+
+The default batch size is one million rows and this is typically a good 
default but
+you may want to customize it if you are reading a large number of columns.
+
+Writing Datasets
+----------------
+
+The dataset API also simplifies writing data to a dataset.  This can be useful 
when
+you want to partition your data or you need to write a large amount of data.  A
+basic dataset write is similar to writing a table except that you specify a 
directory
+instead of a filename.
+
+.. ipython:: python
+
+    base = pathlib.Path(tempfile.gettempdir())
+    dataset_root = base / "sample_dataset"
+    dataset_root.mkdir(exist_ok=True)
+
+    table = pa.table({"a": range(10), "b": np.random.randn(10), "c": [1, 2] * 
5})
+    ds.write_dataset(table, dataset_root, format="parquet")
+
+The above example will create a single file named part-0.parquet in our 
sample_dataset
+diretory.
+
+.. warning::
+
+    If you run the example again it will replace the existing part-0.parquet 
file.
+    Appending files to an existing dataset is not currently supported by this 
API and
+    the output directory should be empty for predictable results.
+
+Writing partitioned data
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A partitioning object can be used to specify how your output data should be 
partitioned.
+This uses the same kind of partitioning objects we used for reading datasets.  
To write
+our above data out to a partitioned directory we only need to specify how we 
want the
+dataset to be partitioned.  For example:
+
+.. ipython:: python
+
+    part = ds.partitioning(
+        pa.schema([("c", pa.int16())]), flavor="hive"
+    )
+    ds.write_dataset(table, dataset_root, format="parquet", partitioning=part)
+
+This will create two files.  Half our data will be in the dataset_root/c=1 
directory and
+the other half will be in the dataset_root/c=2 directory.
+
+Writing large amounts of data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The above examples wrote data from a table.  If you are writing a large amount 
of data
+you may not be able to load everything into a single in-memory table.  
Fortunately, the
+write_dataset method also accepts an iterable of record batches.  This makes 
it really
+simple, for example, to repartition a large dataset without loading the entire 
dataset
+into memory:
+
+.. ipython:: python
+
+    old_part = ds.partitioning(
+        pa.schema([("c", pa.int16())]), flavor="hive"
+    )
+    new_part = ds.partitioning(
+        pa.schema([("c", pa.int16())]), flavor=None
+    )
+    input_dataset = ds.dataset(dataset_root, partitioning=old_part)
+    new_root = base / "repartitioned_dataset"
+    # A scanner can act as an iterator of record batches but you could also 
receive
+    # data from the network (e.g. via flight), from your own scanning, or from 
any
+    # other method that yields record batches.  In addition, you can pass a 
dataset
+    # into write_dataset directly but this method is useful if you want to 
customize
+    # the scanner (e.g. to filter the input dataset or set a maximum batch 
size)
+    scanner = input_dataset.scanner()
+
+    ds.write_dataset(scanner, new_root, format="parquet", 
partitioning=new_part)
+
+After the above example runs our data will be in dataset_root/1 and 
dataset_root/2
+directories.  You could also use this mechnaism to change which columns you 
are partitioned
+on as well.  This is useful when you expect to query your data in specific 
ways and
+you can utilize partitioning to reduce the amount of data you need to read.
+
+.. To add when ARROW-12364 is merged
+    Customizing & inspecting written files
+    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+    By default the dataset API will create files named "part-i.format" where 
"i" is a integer
+    generated during the write and "format" is the file format specified in 
the write_dataset
+    call.  For simple datasets it may be possible to know which files will be 
created but for
+    larger or partitioned datasets it is not so easy.  The ``file_visitor`` 
keyword can be used 
+    to supply a visitor that will be called as each file is created:
+
+    .. ipython:: python
+
+        def file_visitor(written_file):
+            print(f"path={written_file.path}")
+            print(f"metadata={written_file.metadata}")
+        ds.write_dataset(table, dataset_root, format="parquet", 
partitioning=part,
+                        file_visitor=file_visitor)
+
+    This will allow you to collect the filenames that belong to the dataset 
and store them elsewhere
+    which can be useful when you want to avoid scanning directories the next 
time you need to read
+    the data.  It can also be used to generate the _metadata index file used 
by other tools such as
+    dask or spark to create an index of the dataset.
+
+Configuring format-specific parameters during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In addition to the common options shared by all formats there are also format 
specific options
+that are unique to a particular format.  For example, to allow truncated 
timestamps while writing
+parquet files:

Review comment:
       ```suggestion
   Parquet files:
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
------------------
+Iterative (out of core or streaming) reads
+------------------------------------------
 
-..
-    Possible content:
-    - fragments (get_fragments)
-    - scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-    for scan_task in dataset.scan(columns=[...], filter=...):
-        for record_batch in scan_task.execute():
-            # process the record batch
+.. ipython:: python
+
+    import pyarrow.compute as pc
+
+    sum = 0
+    count = 0
+    for batch in dataset.to_batches(filter=~ds.field('col2').is_null()):
+        sum += pc.sum(batch.column('col2')).as_py()
+        count += batch.num_rows
+    mean_a = sum/count
+
+Customizing the batch size
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+An iterative read of a dataset is often called a "scan" of the dataset and 
pyarrow
+uses an object called a :class:`Scanner` to do this.  A Scanner is created for 
you
+automatically by the to_table and to_batches method of the dataset.  Any 
arguments
+you pass to these methods will be passed on to the Scanner constructor.
+
+One of those parameters is the `batch_size`.  This controls the maximum size 
of the
+batches returned by the scanner.  Batches can still be smaller than the 
`batch_size`
+if the dataset consists of small files or those files themselves consist of 
small
+row groups.  For example, a parquet file with 10,000 rows per row group will 
yield
+batches with, at most, 10,000 rows unless the batch_size is set to a smaller 
value.
+
+The default batch size is one million rows and this is typically a good 
default but
+you may want to customize it if you are reading a large number of columns.
+
+Writing Datasets
+----------------
+
+The dataset API also simplifies writing data to a dataset.  This can be useful 
when
+you want to partition your data or you need to write a large amount of data.  A
+basic dataset write is similar to writing a table except that you specify a 
directory
+instead of a filename.
+
+.. ipython:: python
+
+    base = pathlib.Path(tempfile.gettempdir())
+    dataset_root = base / "sample_dataset"
+    dataset_root.mkdir(exist_ok=True)
+
+    table = pa.table({"a": range(10), "b": np.random.randn(10), "c": [1, 2] * 
5})
+    ds.write_dataset(table, dataset_root, format="parquet")
+
+The above example will create a single file named part-0.parquet in our 
sample_dataset
+diretory.

Review comment:
       Is it worth noting IPC (Feather) and CSV are also supported?

##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
------------------
+Iterative (out of core or streaming) reads
+------------------------------------------
 
-..
-    Possible content:
-    - fragments (get_fragments)
-    - scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-    for scan_task in dataset.scan(columns=[...], filter=...):
-        for record_batch in scan_task.execute():
-            # process the record batch
+.. ipython:: python
+
+    import pyarrow.compute as pc
+
+    sum = 0
+    count = 0
+    for batch in dataset.to_batches(filter=~ds.field('col2').is_null()):
+        sum += pc.sum(batch.column('col2')).as_py()
+        count += batch.num_rows
+    mean_a = sum/count
+
+Customizing the batch size
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+An iterative read of a dataset is often called a "scan" of the dataset and 
pyarrow
+uses an object called a :class:`Scanner` to do this.  A Scanner is created for 
you
+automatically by the to_table and to_batches method of the dataset.  Any 
arguments
+you pass to these methods will be passed on to the Scanner constructor.
+
+One of those parameters is the `batch_size`.  This controls the maximum size 
of the
+batches returned by the scanner.  Batches can still be smaller than the 
`batch_size`
+if the dataset consists of small files or those files themselves consist of 
small
+row groups.  For example, a parquet file with 10,000 rows per row group will 
yield
+batches with, at most, 10,000 rows unless the batch_size is set to a smaller 
value.
+
+The default batch size is one million rows and this is typically a good 
default but
+you may want to customize it if you are reading a large number of columns.
+
+Writing Datasets
+----------------
+
+The dataset API also simplifies writing data to a dataset.  This can be useful 
when
+you want to partition your data or you need to write a large amount of data.  A
+basic dataset write is similar to writing a table except that you specify a 
directory
+instead of a filename.
+
+.. ipython:: python
+
+    base = pathlib.Path(tempfile.gettempdir())
+    dataset_root = base / "sample_dataset"
+    dataset_root.mkdir(exist_ok=True)
+
+    table = pa.table({"a": range(10), "b": np.random.randn(10), "c": [1, 2] * 
5})
+    ds.write_dataset(table, dataset_root, format="parquet")
+
+The above example will create a single file named part-0.parquet in our 
sample_dataset
+diretory.

Review comment:
       ```suggestion
   directory.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

Reply via email to