ksuarez1423 commented on code in PR #14018: URL: https://github.com/apache/arrow/pull/14018#discussion_r982683335
########## docs/source/cpp/parquet.rst: ########## @@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex data. The Parquet C++ implementation is part of the Apache Arrow project and benefits from tight integration with the Arrow C++ classes and facilities. +Reading Parquet files +===================== + +The :class:`arrow::FileReader` class reads data for an entire +file or row group into an :class:`::arrow::Table`. + +The :class:`StreamReader` and :class:`StreamWriter` classes allow for +data to be written using a C++ input/output streams approach to +read/write fields column by column and row by row. This approach is +offered for ease of use and type-safety. It is of course also useful +when data must be streamed as files are read and written +incrementally. + +Please note that the performance of the :class:`StreamReader` and +:class:`StreamWriter` classes will not be as good due to the type +checking and the fact that column values are processed one at a time. + +FileReader +---------- + +The Parquet :class:`arrow::FileReader` requires a +:class:`::arrow::io::RandomAccessFile` instance representing the input +file. + +.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc + :language: cpp + :start-after: arrow::Status ReadFullFile( + :end-before: return arrow::Status::OK(); + :emphasize-lines: 9-10,14 + :dedent: 2 + +Finer-grained options are available through the +:class:`arrow::FileReaderBuilder` helper class, and the :class:`ReaderProperties` +and :class:`ArrowReaderProperties` classes. + +For reading as a stream of batches, use the :func:`arrow::FileReader::GetRecordBatchReader`. +It will use the batch size set in :class:`ArrowReaderProperties`. + +.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc + :language: cpp + :start-after: arrow::Status ReadInBatches( + :end-before: return arrow::Status::OK(); + :emphasize-lines: 25 + :dedent: 2 + +Performance and Memory Efficiency +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For remote filesystems, use read coalescing to reduce number of API calls: + +.. code-block:: cpp + + auto reader_properties = parquet::ReaderProperties(pool); + reader_properties.enable_buffered_stream(); + reader_properties.set_buffer_size(4096 * 4); // This is default value + +The defaults are generally tuned towards good performance, but parallel column +decoding is off by default. Enable it in the constructor of :class:`ArrowReaderProperties`: + +.. code-block:: cpp + + auto arrow_reader_props = parquet::ArrowReaderProperties(/*use_threads=*/true); + +If memory efficiency is more important than performance, then: + +#. Do not turn on read coalescing (pre-buffering). Review Comment: No, I meant the "(pre-buffering)", the literal text; that you could drop the bit in parentheses if you wanted, if you added it to the previous "read coalescing" introduction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
