[GitHub] [arrow] pitrou commented on a diff in pull request #14018: ARROW-14161: [C++][Docs] Improve Parquet C++ docs

GitBox Mon, 07 Nov 2022 11:11:44 -0800


pitrou commented on code in PR #14018:
URL: https://github.com/apache/arrow/pull/14018#discussion_r1015788915



##########
cpp/src/parquet/arrow/reader.h:
##########
@@ -301,6 +321,11 @@ class PARQUET_EXPORT FileReaderBuilder {
                        const ReaderProperties& properties = 
default_reader_properties(),
                        std::shared_ptr<FileMetaData> metadata = NULLPTR);
 
+  /// Create FileReaderBuilder from Arrow file and optional properties / 
metadata

Review Comment:
   ```suggestion
     /// Create FileReaderBuilder from file path and optional properties / 
metadata
   ```



##########
cpp/src/parquet/arrow/writer.h:
##########
@@ -54,36 +57,69 @@ class PARQUET_EXPORT FileWriter {
                               std::shared_ptr<ArrowWriterProperties> 
arrow_properties,
                               std::unique_ptr<FileWriter>* out);
 
+  /// \brief Try to create an Arrow to Parquet file writer.
+  ///
+  /// \param schema schema of data that will be passed.
+  /// \param pool memory pool to use.
+  /// \param sink output stream to write Parquet data.
+  /// \param properties general Parquet writer properties.
+  /// \param arrow_properties Arrow-specific writer properties.
+  ///
+  /// \since 11.0.0
+  static ::arrow::Result<std::unique_ptr<FileWriter>> Open(
+      const ::arrow::Schema& schema, MemoryPool* pool,
+      std::shared_ptr<::arrow::io::OutputStream> sink,
+      std::shared_ptr<WriterProperties> properties = 
default_writer_properties(),
+      std::shared_ptr<ArrowWriterProperties> arrow_properties =
+          default_arrow_writer_properties());
+
+  ARROW_DEPRECATED("Deprecated in 11.0.0. Use result variants instead.")
   static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool,
                               std::shared_ptr<::arrow::io::OutputStream> sink,
                               std::shared_ptr<WriterProperties> properties,
                               std::unique_ptr<FileWriter>* writer);
-
+  ARROW_DEPRECATED("Deprecated in 11.0.0. Use result variants instead.")

Review Comment:
   Same suggestions here.



##########
cpp/src/parquet/arrow/writer.h:
##########
@@ -98,9 +134,20 @@ ::arrow::Status WriteMetaDataFile(const FileMetaData& 
file_metadata,
                                   ::arrow::io::OutputStream* sink);
 
 /// \brief Write a Table to Parquet.
+///
+/// This writes one table in a single shot. To write a Parquet file with
+/// multiple tables iteratively, see parquet::arrow::FileWriter.
+///
+/// \param table Table to write.
+/// \param pool memory pool to use.
+/// \param sink output stream to write Parquet data.
+/// \param chunk_size maximum size of row groups to write.

Review Comment:
   Same question here.



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,310 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data into Arrow Tables and Record
+Batches.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is
+offered for ease of use and type-safety.  It is of course also useful
+when data must be streamed as files are read and written
+incrementally.
+
+Please note that the performance of the :class:`StreamReader` and
+:class:`StreamWriter` classes will not be as good due to the type
+checking and the fact that column values are processed one at a time.
+
+FileReader
+----------
+
+To read Parquet data into Arrow structures, use :class:`arrow::FileReader`.
+To construct, it requires a :class:`::arrow::io::RandomAccessFile` instance 
+representing the input file. To read the whole file at once, 
+use :func:`arrow::FileReader::ReadTable`:
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 9-10,14
+   :dedent: 2
+
+Finer-grained options are available through the
+:class:`arrow::FileReaderBuilder` helper class, which accepts the 
:class:`ReaderProperties`
+and :class:`ArrowReaderProperties` classes.
+
+For reading as a stream of batches, use the 
:func:`arrow::FileReader::GetRecordBatchReader`
+method to retrieve a :class:`arrow::RecordBatchReader`. It will use the batch 
+size set in :class:`ArrowReaderProperties`.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadInBatches(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 25
+   :dedent: 2
+
+.. seealso::
+
+   For reading multi-file datasets or pushing down filters to prune row groups,
+   see :ref:`Tabular Datasets<cpp-dataset>`.
+
+Performance and Memory Efficiency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For remote filesystems, use read coalescing (pre-buffering) to reduce number 
of API calls:
+
+.. code-block:: cpp
+
+   auto arrow_reader_props = parquet::ArrowReaderProperties();
+   reader_properties.set_prebuffer(true);
+
+The defaults are generally tuned towards good performance, but parallel column
+decoding is off by default. Enable it in the constructor of 
:class:`ArrowReaderProperties`:
+
+.. code-block:: cpp
+
+   auto arrow_reader_props = 
parquet::ArrowReaderProperties(/*use_threads=*/true);
+
+If memory efficiency is more important than performance, then:
+
+#. Do *not* turn on read coalescing (pre-buffering) in 
:class:`parquet::ArrowReaderProperties`.
+#. Read data in batches using :func:`arrow::FileReader::GetRecordBatchReader`.
+#. Turn on ``enable_buffered_stream`` in :class:`parquet::ReaderProperties`.
+
+In addition, if you know certain columns contain many repeated values, you can
+read them as :term:`dictionary encoded<dictionary-encoding>` columns. This is 
+enabled with the ``set_read_dictionary`` setting on 
:class:`ArrowReaderProperties`. 
+If the files were written with Arrow C++ and the ``store_schema`` was 
activated,
+then the original Arrow schema will be automatically read and will override 
this
+setting.
+
+StreamReader
+------------
+
+The :class:`StreamReader` allows for Parquet files to be read using
+standard C++ input operators which ensures type-safety.
+
+Please note that types must match the schema exactly i.e. if the
+schema field is an unsigned 16-bit integer then you must supply a
+``uint16_t`` type.
+
+Exceptions are used to signal errors.  A :class:`ParquetException` is
+thrown in the following circumstances:
+
+* Attempt to read field by supplying the incorrect type.
+
+* Attempt to read beyond end of row.
+
+* Attempt to read beyond end of file.
+
+.. code-block:: cpp
+
+   #include "arrow/io/file.h"
+   #include "parquet/stream_reader.h"
+
+   {
+      std::shared_ptr<arrow::io::ReadableFile> infile;
+
+      PARQUET_ASSIGN_OR_THROW(
+         infile,
+         arrow::io::ReadableFile::Open("test.parquet"));
+
+      parquet::StreamReader stream{parquet::ParquetFileReader::Open(infile)};
+
+      std::string article;
+      float price;
+      uint32_t quantity;
+
+      while ( !stream.eof() )
+      {
+         stream >> article >> price >> quantity >> parquet::EndRow;
+         // ...
+      }
+   }
+
+Writing Parquet files
+=====================
+
+WriteTable
+----------
+
+The :func:`arrow::WriteTable` function writes an entire
+:class:`::arrow::Table` to an output file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status WriteFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 19-21
+   :dedent: 2
+
+.. note::
+
+   Column compression is off by default in C++. See :ref:`below 
<parquet-writer-properties>` 
+   for how to choose a compression codec in the writer properties.
+
+To write out data batch-by-batch, use :class:`arrow::FileWriter`.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status WriteInBatches(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 23-25,32,36
+   :dedent: 2
+
+StreamWriter
+------------
+
+The :class:`StreamWriter` allows for Parquet files to be written using
+standard C++ output operators.  This type-safe approach also ensures
+that rows are written without omitting fields and allows for new row
+groups to be created automatically (after certain volume of data) or
+explicitly by using the :type:`EndRowGroup` stream modifier.
+
+Exceptions are used to signal errors.  A :class:`ParquetException` is
+thrown in the following circumstances:
+
+* Attempt to write a field using an incorrect type.
+
+* Attempt to write too many fields in a row.
+
+* Attempt to skip a required field.
+
+.. code-block:: cpp
+
+   #include "arrow/io/file.h"
+   #include "parquet/stream_writer.h"
+
+   {
+      std::shared_ptr<arrow::io::FileOutputStream> outfile;
+
+      PARQUET_ASSIGN_OR_THROW(
+         outfile,
+         arrow::io::FileOutputStream::Open("test.parquet"));
+
+      parquet::WriterProperties::Builder builder;
+      std::shared_ptr<parquet::schema::GroupNode> schema;
+
+      // Set up builder with required compression type etc.
+      // Define schema.
+      // ...
+
+      parquet::StreamWriter os{
+         parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};
+
+      // Loop over some data structure which provides the required
+      // fields to be written and write each row.
+      for (const auto& a : getArticles())
+      {
+         os << a.name() << a.price() << a.quantity() << parquet::EndRow;
+      }
+   }
+
+.. _parquet-writer-properties:
+
+Writer properties
+-----------------
+
+To configure how Parquet files are written, use the 
:class:`WriterProperties::Builder`:
+
+.. code-block:: cpp
+
+   #include "parquet/arrow/writer.h"
+   #include "arrow/util/type_fwd.h"
+
+   using parquet::WriterProperties;
+   using parquet::ParquetVersion;
+   using parquet::ParquetDataPageVersion;
+   using arrow::Compression;
+
+   std::shared_ptr<WriterProperties> props = WriterProperties::Builder()
+      .max_row_group_length(64 * 1024)
+      .created_by("My Application")
+      .version(ParquetVersion::PARQUET_2_6)
+      .data_page_version(ParquetDataPageVersion::V2)
+      .compression(Compression::SNAPPY)
+      .build();
+
+The ``max_row_group_length`` sets an upper bound on the number of rows per row
+group that takes precedent over the ``chunk_size`` passed in the write methods.
+
+You can set the version of Parquet to write with ``version``, which determines
+which logical types are available. In addition, you can set the data page 
version
+with ``data_page_version``. It's V1 by default; setting to V2 will allow more
+optimal compression (skipping compressing pages where there isn't a space 
+benefit), but not all readers support this data page version.
+
+Compression is off by default, but to get the most out of Parquet, you should 
+also choose a compression codec. You can choose one for the whole file or 
+choose one for individual columns. If you choose a mix, the file-level option
+will apply to columns that don't have a specific compression codec. See 
+:class:`::arrow::Compression` for options.
+
+Column data encodings can likewise be applied at the file-level or at the 
+column level. By default, the writer will attempt to dictionary encode all 
+supported columns, unless the dictionary grows too large. This behavior can
+be changed at file-level or at the column level with ``disable_dictionary()``.
+When not using dictionary encoding, it will fallback to the encoding set for 
+the column or the overall file; by default ``Encoding::PLAIN``, but this can
+be changed with ``encoding()``.
+
+.. code-block:: cpp
+
+   #include "parquet/arrow/writer.h"
+   #include "arrow/util/type_fwd.h"
+
+   using parquet::WriterProperties;
+   using arrow::Compression;
+   using parquet::Encoding;
+
+   std::shared_ptr<WriterProperties> props = WriterProperties::Builder()
+     .compression(Compression::SNAPPY)        // Fallback
+     ->compression("colA", Compression::ZSTD) // Only applies to column "colA"
+     ->encoding(Encoding::BIT_PACKED)         // Fallback
+     ->encoding("colB", Encoding::RLE)        // Only applies to column "colB"
+     ->disable_dictionary("colB")             // Never dictionary-encode 
column "colB"
+     ->build();
+
+Statistics are enabled by default for all columns. You can disable statistics 
for
+all columns or specific columns using ``disable_statistics`` on the builder.
+There is a ``max_statistics_size`` which limits the maximum number of bytes 
that
+may be used for min and max values, useful for types like strings or binary 
blobs.
+
+There are also Arrow-specific settings that can be configured with
+:class:`parquet::ArrowWriterProperties`:
+
+.. code-block:: cpp
+
+   #include "parquet/arrow/writer.h"
+
+   using parquet::ArrowWriterProperties;
+
+   std::shared_ptr<ArrowWriterProperties> arrow_props = 
ArrowWriterProperties::Builder()
+      .enable_deprecated_int96_timestamps() // default False
+      ->store_schema() // default False
+      ->enable_compliant_nested_types() // default False
+      ->build();
+
+These options mostly dictate how Arrow types are converted to Parquet types.
+Turning on ``store_schema`` will cause the writer to store the serialized Arrow
+schema within the file metadata. Since there is no bijection between Parquet
+schemas and Arrow schemas, storing the Arrow schema allows the Arrow reader
+to more faithfully recreate the original data. This mapping from Parquet types
+back to original Arrow types includes:
+
+* Reading timestamps with original timezone information (Parquet does not
+  support time zones);
+* Reading Arrow types from their storage types (such as Duration from int64
+  columns);
+* Reading string and binary columns back into large variants with 64-bit 
offsets;
+* Reading back columns as dictionary encoded (whether an Arrow column and a 
+  the serialized Parquet version are dictionary encoded are independent).

Review Comment:
   ```suggestion
   * Reading back columns as dictionary encoded (whether an Arrow column and
     the serialized Parquet version are dictionary encoded are independent).
   ```



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,310 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data into Arrow Tables and Record
+Batches.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for

Review Comment:
   I second this comment :-)



##########
cpp/src/parquet/arrow/writer.h:
##########
@@ -54,36 +57,69 @@ class PARQUET_EXPORT FileWriter {
                               std::shared_ptr<ArrowWriterProperties> 
arrow_properties,
                               std::unique_ptr<FileWriter>* out);
 
+  /// \brief Try to create an Arrow to Parquet file writer.
+  ///
+  /// \param schema schema of data that will be passed.
+  /// \param pool memory pool to use.
+  /// \param sink output stream to write Parquet data.
+  /// \param properties general Parquet writer properties.
+  /// \param arrow_properties Arrow-specific writer properties.
+  ///
+  /// \since 11.0.0
+  static ::arrow::Result<std::unique_ptr<FileWriter>> Open(
+      const ::arrow::Schema& schema, MemoryPool* pool,
+      std::shared_ptr<::arrow::io::OutputStream> sink,
+      std::shared_ptr<WriterProperties> properties = 
default_writer_properties(),
+      std::shared_ptr<ArrowWriterProperties> arrow_properties =
+          default_arrow_writer_properties());
+
+  ARROW_DEPRECATED("Deprecated in 11.0.0. Use result variants instead.")

Review Comment:
   ```suggestion
     ARROW_DEPRECATED("Deprecated in 11.0.0. Use Result-returning variants 
instead.")
   ```



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,310 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data into Arrow Tables and Record
+Batches.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is
+offered for ease of use and type-safety.  It is of course also useful
+when data must be streamed as files are read and written
+incrementally.
+
+Please note that the performance of the :class:`StreamReader` and
+:class:`StreamWriter` classes will not be as good due to the type
+checking and the fact that column values are processed one at a time.
+
+FileReader
+----------
+
+To read Parquet data into Arrow structures, use :class:`arrow::FileReader`.
+To construct, it requires a :class:`::arrow::io::RandomAccessFile` instance 
+representing the input file. To read the whole file at once, 
+use :func:`arrow::FileReader::ReadTable`:
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 9-10,14
+   :dedent: 2
+
+Finer-grained options are available through the
+:class:`arrow::FileReaderBuilder` helper class, which accepts the 
:class:`ReaderProperties`
+and :class:`ArrowReaderProperties` classes.
+
+For reading as a stream of batches, use the 
:func:`arrow::FileReader::GetRecordBatchReader`
+method to retrieve a :class:`arrow::RecordBatchReader`. It will use the batch 
+size set in :class:`ArrowReaderProperties`.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadInBatches(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 25
+   :dedent: 2
+
+.. seealso::
+
+   For reading multi-file datasets or pushing down filters to prune row groups,
+   see :ref:`Tabular Datasets<cpp-dataset>`.
+
+Performance and Memory Efficiency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For remote filesystems, use read coalescing (pre-buffering) to reduce number 
of API calls:
+
+.. code-block:: cpp
+
+   auto arrow_reader_props = parquet::ArrowReaderProperties();
+   reader_properties.set_prebuffer(true);
+
+The defaults are generally tuned towards good performance, but parallel column
+decoding is off by default. Enable it in the constructor of 
:class:`ArrowReaderProperties`:
+
+.. code-block:: cpp
+
+   auto arrow_reader_props = 
parquet::ArrowReaderProperties(/*use_threads=*/true);
+
+If memory efficiency is more important than performance, then:
+
+#. Do *not* turn on read coalescing (pre-buffering) in 
:class:`parquet::ArrowReaderProperties`.
+#. Read data in batches using :func:`arrow::FileReader::GetRecordBatchReader`.
+#. Turn on ``enable_buffered_stream`` in :class:`parquet::ReaderProperties`.
+
+In addition, if you know certain columns contain many repeated values, you can
+read them as :term:`dictionary encoded<dictionary-encoding>` columns. This is 
+enabled with the ``set_read_dictionary`` setting on 
:class:`ArrowReaderProperties`. 
+If the files were written with Arrow C++ and the ``store_schema`` was 
activated,
+then the original Arrow schema will be automatically read and will override 
this
+setting.
+
+StreamReader
+------------
+
+The :class:`StreamReader` allows for Parquet files to be read using
+standard C++ input operators which ensures type-safety.
+
+Please note that types must match the schema exactly i.e. if the
+schema field is an unsigned 16-bit integer then you must supply a
+``uint16_t`` type.
+
+Exceptions are used to signal errors.  A :class:`ParquetException` is
+thrown in the following circumstances:
+
+* Attempt to read field by supplying the incorrect type.
+
+* Attempt to read beyond end of row.
+
+* Attempt to read beyond end of file.
+
+.. code-block:: cpp
+
+   #include "arrow/io/file.h"
+   #include "parquet/stream_reader.h"
+
+   {
+      std::shared_ptr<arrow::io::ReadableFile> infile;
+
+      PARQUET_ASSIGN_OR_THROW(
+         infile,
+         arrow::io::ReadableFile::Open("test.parquet"));
+
+      parquet::StreamReader stream{parquet::ParquetFileReader::Open(infile)};
+
+      std::string article;
+      float price;
+      uint32_t quantity;
+
+      while ( !stream.eof() )
+      {
+         stream >> article >> price >> quantity >> parquet::EndRow;
+         // ...
+      }
+   }
+
+Writing Parquet files
+=====================
+
+WriteTable
+----------
+
+The :func:`arrow::WriteTable` function writes an entire
+:class:`::arrow::Table` to an output file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status WriteFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 19-21
+   :dedent: 2
+
+.. note::
+
+   Column compression is off by default in C++. See :ref:`below 
<parquet-writer-properties>` 
+   for how to choose a compression codec in the writer properties.
+
+To write out data batch-by-batch, use :class:`arrow::FileWriter`.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status WriteInBatches(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 23-25,32,36
+   :dedent: 2
+
+StreamWriter
+------------
+
+The :class:`StreamWriter` allows for Parquet files to be written using
+standard C++ output operators.  This type-safe approach also ensures
+that rows are written without omitting fields and allows for new row
+groups to be created automatically (after certain volume of data) or
+explicitly by using the :type:`EndRowGroup` stream modifier.
+
+Exceptions are used to signal errors.  A :class:`ParquetException` is
+thrown in the following circumstances:
+
+* Attempt to write a field using an incorrect type.
+
+* Attempt to write too many fields in a row.

Review Comment:
   Too few as well, or are the remaining fields treated as omitted?



##########
cpp/examples/arrow/parquet_read_write.cc:
##########
@@ -0,0 +1,189 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/api.h"
+#include "arrow/io/api.h"
+#include "arrow/result.h"
+#include "arrow/util/type_fwd.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/arrow/writer.h"
+
+#include <iostream>
+
+arrow::Status ReadFullFile(std::string path_to_file) {
+  // #include "arrow/io/api.h"
+  // #include "arrow/parquet/arrow/reader.h"
+
+  arrow::MemoryPool* pool = arrow::default_memory_pool();
+  std::shared_ptr<arrow::io::RandomAccessFile> input;
+  ARROW_ASSIGN_OR_RAISE(input, arrow::io::ReadableFile::Open(path_to_file));
+
+  // Open Parquet file reader
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ARROW_RETURN_NOT_OK(parquet::arrow::OpenFile(input, pool, &arrow_reader));
+
+  // Read entire file as a single Arrow table
+  std::shared_ptr<arrow::Table> table;
+  ARROW_RETURN_NOT_OK(arrow_reader->ReadTable(&table));
+  return arrow::Status::OK();
+}
+
+arrow::Status ReadInBatches(std::string path_to_file) {
+  // #include "arrow/io/api.h"
+  // #include "arrow/parquet/arrow/reader.h"
+
+  arrow::MemoryPool* pool = arrow::default_memory_pool();
+
+  // Configure general Parquet reader settings
+  auto reader_properties = parquet::ReaderProperties(pool);
+  reader_properties.set_buffer_size(4096 * 4);
+  reader_properties.enable_buffered_stream();
+
+  // Configure Arrow-specific Parquet reader settings
+  auto arrow_reader_props = parquet::ArrowReaderProperties();
+  arrow_reader_props.set_batch_size(128 * 1024);  // default 64 * 1024
+
+  parquet::arrow::FileReaderBuilder reader_builder;
+  ARROW_RETURN_NOT_OK(
+      reader_builder.OpenFile(path_to_file, /*memory_map=*/false, 
reader_properties));
+  reader_builder.memory_pool(pool);
+  reader_builder.properties(arrow_reader_props);
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ARROW_ASSIGN_OR_RAISE(arrow_reader, reader_builder.Build());
+
+  std::shared_ptr<::arrow::RecordBatchReader> rb_reader;
+  ARROW_RETURN_NOT_OK(arrow_reader->GetRecordBatchReader(&rb_reader));
+
+  for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : 
*rb_reader) {
+    // Operate on each batch...
+  }
+  return arrow::Status::OK();
+}
+
+arrow::Result<std::shared_ptr<arrow::Table>> GetTable() {
+  auto builder = arrow::Int32Builder();
+
+  std::shared_ptr<arrow::Array> arr_x;
+  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 3, 5, 7, 1}));
+  ARROW_RETURN_NOT_OK(builder.Finish(&arr_x));
+
+  std::shared_ptr<arrow::Array> arr_y;
+  ARROW_RETURN_NOT_OK(builder.AppendValues({2, 4, 6, 8, 10}));
+  ARROW_RETURN_NOT_OK(builder.Finish(&arr_y));
+
+  auto schema = arrow::schema(
+      {arrow::field("x", arrow::int32()), arrow::field("y", arrow::int32())});
+
+  return arrow::Table::Make(schema, {arr_x, arr_y});
+}
+
+arrow::Result<std::shared_ptr<arrow::TableBatchReader>> GetRBR() {
+  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> table, GetTable());
+  auto reader = std::make_shared<arrow::TableBatchReader>(table);
+  reader->set_chunksize(10);
+  return reader;
+}
+
+arrow::Status WriteFullFile(std::string path_to_file) {
+  // #include "parquet/arrow/writer.h"
+  // #include "arrow/util/type_fwd.h"
+  using parquet::ArrowWriterProperties;
+  using parquet::WriterProperties;
+
+  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> table, GetTable());
+
+  // Choose compression
+  std::shared_ptr<WriterProperties> props =
+      
WriterProperties::Builder().compression(arrow::Compression::SNAPPY)->build();
+
+  // Opt to store Arrow schema for easier reads back into Arrow
+  std::shared_ptr<ArrowWriterProperties> arrow_props =
+      ArrowWriterProperties::Builder().store_schema()->build();
+
+  std::shared_ptr<arrow::io::FileOutputStream> outfile;
+  ARROW_ASSIGN_OR_RAISE(outfile, 
arrow::io::FileOutputStream::Open(path_to_file));
+
+  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(*table.get(),
+                                                 arrow::default_memory_pool(), 
outfile,
+                                                 /*chunk_size=*/3, props, 
arrow_props));
+  return arrow::Status::OK();
+}
+
+arrow::Status WriteInBatches(std::string path_to_file) {
+  // #include "parquet/arrow/writer.h"
+  // #include "arrow/util/type_fwd.h"
+  using parquet::ArrowWriterProperties;
+  using parquet::WriterProperties;
+
+  // Data is in RBR
+  std::shared_ptr<arrow::RecordBatchReader> batch_stream;
+  ARROW_ASSIGN_OR_RAISE(batch_stream, GetRBR());
+
+  // Choose compression
+  std::shared_ptr<WriterProperties> props =
+      
WriterProperties::Builder().compression(arrow::Compression::SNAPPY)->build();
+
+  // Opt to store Arrow schema for easier reads back into Arrow
+  std::shared_ptr<ArrowWriterProperties> arrow_props =
+      ArrowWriterProperties::Builder().store_schema()->build();
+
+  // Create a writer
+  std::shared_ptr<arrow::io::FileOutputStream> outfile;
+  ARROW_ASSIGN_OR_RAISE(outfile, 
arrow::io::FileOutputStream::Open(path_to_file));
+  std::unique_ptr<parquet::arrow::FileWriter> writer;
+  ARROW_ASSIGN_OR_RAISE(
+      writer, parquet::arrow::FileWriter::Open(*batch_stream->schema().get(),
+                                               arrow::default_memory_pool(), 
outfile,
+                                               props, arrow_props));
+
+  // Write each batch as a row_group
+  for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : 
*batch_stream) {
+    ARROW_ASSIGN_OR_RAISE(auto batch, maybe_batch);
+    ARROW_ASSIGN_OR_RAISE(auto table,
+                          arrow::Table::FromRecordBatches(batch->schema(), 
{batch}));
+    ARROW_RETURN_NOT_OK(writer->WriteTable(*table.get(), batch->num_rows()));
+  }
+
+  // Write file footer and close
+  ARROW_RETURN_NOT_OK(writer->Close());
+
+  return arrow::Status::OK();
+}
+
+arrow::Status RunExamples(std::string path_to_file) {
+  ARROW_RETURN_NOT_OK(WriteFullFile(path_to_file));
+  ARROW_RETURN_NOT_OK(ReadFullFile(path_to_file));
+  ARROW_RETURN_NOT_OK(ReadInBatches(path_to_file));
+  return arrow::Status::OK();

Review Comment:
   `WriteInBatches` isn't exercised here, is that deliberate?



##########
cpp/src/parquet/arrow/writer.h:
##########
@@ -54,36 +57,69 @@ class PARQUET_EXPORT FileWriter {
                               std::shared_ptr<ArrowWriterProperties> 
arrow_properties,
                               std::unique_ptr<FileWriter>* out);
 
+  /// \brief Try to create an Arrow to Parquet file writer.
+  ///
+  /// \param schema schema of data that will be passed.
+  /// \param pool memory pool to use.
+  /// \param sink output stream to write Parquet data.
+  /// \param properties general Parquet writer properties.
+  /// \param arrow_properties Arrow-specific writer properties.
+  ///
+  /// \since 11.0.0
+  static ::arrow::Result<std::unique_ptr<FileWriter>> Open(
+      const ::arrow::Schema& schema, MemoryPool* pool,
+      std::shared_ptr<::arrow::io::OutputStream> sink,
+      std::shared_ptr<WriterProperties> properties = 
default_writer_properties(),
+      std::shared_ptr<ArrowWriterProperties> arrow_properties =
+          default_arrow_writer_properties());
+
+  ARROW_DEPRECATED("Deprecated in 11.0.0. Use result variants instead.")
   static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool,
                               std::shared_ptr<::arrow::io::OutputStream> sink,
                               std::shared_ptr<WriterProperties> properties,
                               std::unique_ptr<FileWriter>* writer);
-
+  ARROW_DEPRECATED("Deprecated in 11.0.0. Use result variants instead.")
   static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool,
                               std::shared_ptr<::arrow::io::OutputStream> sink,
                               std::shared_ptr<WriterProperties> properties,
                               std::shared_ptr<ArrowWriterProperties> 
arrow_properties,
                               std::unique_ptr<FileWriter>* writer);
 
+  /// Return the Arrow schema to be written to.
   virtual std::shared_ptr<::arrow::Schema> schema() const = 0;
 
   /// \brief Write a Table to Parquet.
-  virtual ::arrow::Status WriteTable(const ::arrow::Table& table, int64_t 
chunk_size) = 0;
-
+  ///
+  /// \param table Arrow table to write.
+  /// \param chunk_size maximum size of row groups to write.

Review Comment:
   In rows or bytes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou commented on a diff in pull request #14018: ARROW-14161: [C++][Docs] Improve Parquet C++ docs

Reply via email to