Radu Teodorescu created ARROW-11465:
---------------------------------------
Summary: Parquet file writer snapshot API and proper
ColumnChunk.file_path utilization
Key: ARROW-11465
URL: https://issues.apache.org/jira/browse/ARROW-11465
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Affects Versions: 3.0.0
Reporter: Radu Teodorescu
Assignee: Radu Teodorescu
Fix For: 4.0.0
This is a follow up to the thread:
[https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%[email protected]%3e]
The specific use case I am targeting is having the ability to partially read a
parquet file while it's still being written to.
This is relevant for any process that is recording events over a long period of
times and writing them to parquet (tracing data, logging events or any other
live time series)
The solution relies on the fact that parquet specifications allows column chunk
metadata to point explicitly to its location in a file which can theoretically
be different from the file containing the metadata (as covered in other
threads, this behavior is not fully supported by major parquet implementations).
My solution is centered around adding a method,
{{void ParquetFileWriter::Snapshot(const std::string& data_path,
std::shared_ptr<::arrow::io::OutputStream>&
sink) }}
,that writes writes the metadata for the RowGroups given so far to the {{sink}}
stream and updates all the ColumnChunk metadata {{file_path}} to point to
{{data_path}}. This was intended as a minimalist change to {{ParquetFileWriter}}
On the reading side I implemented full support for ColumnChunk.file_path by
introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}} in
the {{ParquetFileReader}} implementation stack. In the PR implementation one
can default to the current behavior by using the {{SingleFile}} class, have
full read support for multi-file parquet in line with the specification by
using {{MultiReadableFile}} implementation (that captures the metafile base
directory and uses it as the base directory to the ColumChunk.file_path) or one
can provide a separate implementation for a non-posix file system storage.
For an example see {{write_parquet_file_with_snapshot}} function in
reader-writer.cc that illustrates the snapshotting write while the
{{read_whole_file}} function has been modified to read one of the snapshots (I
will rollback that change and provide separate example before the merge)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)