[GitHub] [arrow] drin commented on a change in pull request #9810: ARROW-11677: [C++][Docs] Add basic C++ datasets documentation

GitBox Thu, 08 Apr 2021 20:52:01 -0700


drin commented on a change in pull request #9810:
URL: https://github.com/apache/arrow/pull/9810#discussion_r610324994




##########
File path: docs/source/cpp/dataset.rst
##########
@@ -0,0 +1,381 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+
+================
+Tabular Datasets
+================
+
+.. seealso::
+   :doc:`Dataset API reference <api/dataset>`
+
+.. warning::
+
+    The ``arrow::dataset`` namespace is experimental, and a stable API
+    is not yet guaranteed.
+
+The Arrow Datasets library provides functionality to efficiently work with
+tabular, potentially larger than memory and multi-file datasets:
+
+* A unified interface for different sources: supporting different sources and
+  file formats (Parquet, Feather files) and different file systems (local,
+  cloud).
+* Discovery of sources (crawling directories, handle directory-based 
partitioned
+  datasets, basic schema normalization, ..)
+* Optimized reading with predicate pushdown (filtering rows), projection
+  (selecting columns), parallel reading, or fine-grained managing of tasks.
+
+Currently, only Parquet, Feather / Arrow IPC, and CSV files are supported. The
+goal is to expand this in the future to other file formats and data sources
+(e.g.  database connections).
+
+Reading Datasets
+----------------
+
+For the examples below, let's create a small dataset consisting
+of a directory with two parquet files:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 50-85
+   :linenos:
+   :lineno-match:
+
+(See the full example at bottom: :ref:`cpp-dataset-full-example`.)
+
+Dataset discovery
+~~~~~~~~~~~~~~~~~
+
+A :class:`arrow::dataset::Dataset` object can be created using the various
+:class:`arrow::dataset::DatasetFactory` objects. Here, we'll use the
+:class:`arrow::dataset::FileSystemDatasetFactory`, which can create a dataset
+given the base directory path:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 151-170
+   :emphasize-lines: 6-11
+   :linenos:
+   :lineno-match:
+
+Here, we're also passing the filesystem to use and the file format to
+read. This lets us choose between things like reading local files or files in
+Amazon S3, or between Parquet and CSV.
+
+In addition to a base directory path, we can list file paths manually.
+
+Creating a :class:`arrow::dataset::Dataset` in this way loads nothing into
+memory; it only crawls the directory to find all the files
+(:func:`arrow::dataset::FileSystemDataset::files`):
+
+.. code-block:: cpp
+
+   for (const auto& filename : dataset->files()) {
+     std::cout << filename << std::endl;
+   }
+
+…and infers the dataset's schema (by default from the first file):
+
+.. code-block:: cpp
+
+   std::cout << dataset->schema()->ToString() << std::endl;
+
+Using the :func:`arrow::dataset::Dataset::NewScan` method, we can build a
+:class:`arrow::dataset::Scanner` and read the dataset (or a portion of it) into
+a :class:`arrow::Table` with the :func:`arrow::dataset::Scanner::ToTable`
+method:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 151-170
+   :emphasize-lines: 16-19
+   :linenos:
+   :lineno-match:
+
+.. TODO: iterative loading not documented pending API changes
+.. note:: Depending on the size of your dataset, this can require a lot of
+          memory; see :ref:`cpp-dataset-filtering-data` below on
+          filtering/projecting.
+
+Reading different file formats
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The above examples use Parquet files on local disk, but the Dataset API
+provides a consistent interface across multiple file formats and sources.
+Currently, Parquet, Feather / Arrow IPC, and CSV file formats are supported;
+more formats are planned in the future.
+
+If we save the table as a Feather file instead of Parquet files:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 87-104
+   :linenos:
+   :lineno-match:
+
+then we can read the Feather file using the same functions, but passing a
+:class:`arrow::dataset::IpcFileFormat`:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 310,323
+   :linenos:
+
+Customizing file formats
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+:class:`arrow::dataset::FileFormat` objects have properties that control how
+they are read. For example::

Review comment:
       I also think that since the member variable is `reader_options`, I'd 
prefer to see:
   
   `FileFormat objects have options that control read behavior`
   
   ...but only ParquetFileFormat has `reader_options`? Is this called something 
different in each `FileFormat` class?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] drin commented on a change in pull request #9810: ARROW-11677: [C++][Docs] Add basic C++ datasets documentation

Reply via email to