[jira] [Created] (ARROW-11465) Parquet file writer snapshot API and proper ColumnChunk.file_path utilization

Radu Teodorescu (Jira) Mon, 01 Feb 2021 14:23:05 -0800

Radu Teodorescu created ARROW-11465:
---------------------------------------


             Summary: Parquet file writer snapshot API and proper 
ColumnChunk.file_path utilization
                 Key: ARROW-11465
                 URL: https://issues.apache.org/jira/browse/ARROW-11465
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
    Affects Versions: 3.0.0
            Reporter: Radu Teodorescu
            Assignee: Radu Teodorescu
             Fix For: 4.0.0


This is a follow up to the thread:
[https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%[email protected]%3e]

The specific use case I am targeting is having the ability to partially read a 
parquet file while it's still being written to.
This is relevant for any process that is recording events over a long period of 
times and writing them to parquet (tracing data, logging events or any other 
live time series)
The solution relies on the fact that parquet specifications allows column chunk 
metadata to point explicitly to its location in a file which can theoretically 
be different from the file containing the metadata (as covered in other 
threads, this behavior is not fully supported by major parquet implementations).
My solution is centered around adding a method,

 

{{void ParquetFileWriter::Snapshot(const std::string& data_path,
                                 std::shared_ptr<::arrow::io::OutputStream>& 
sink) }}

,that writes writes the metadata for the RowGroups given so far to the {{sink}} 
stream and updates all the ColumnChunk metadata {{file_path}} to point to 
{{data_path}}. This was intended as a minimalist change to {{ParquetFileWriter}}

On the reading side I implemented full support for ColumnChunk.file_path by 
introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}} in 
the {{ParquetFileReader}} implementation stack. In the PR implementation one 
can default to the current behavior by using the {{SingleFile}} class, have 
full read support for multi-file parquet in line with the specification by 
using {{MultiReadableFile}} implementation (that captures the metafile base 
directory and uses it as the base directory to the ColumChunk.file_path) or one 
can provide a separate implementation for a non-posix file system storage.

For an example see {{write_parquet_file_with_snapshot}} function in 
reader-writer.cc that illustrates the snapshotting write while the 
{{read_whole_file}} function has been modified to read one of the snapshots (I 
will rollback that change and provide separate example before the merge)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11465) Parquet file writer snapshot API and proper ColumnChunk.file_path utilization

Reply via email to