[
https://issues.apache.org/jira/browse/ARROW-11465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alessandro Molina updated ARROW-11465:
--------------------------------------
Fix Version/s: (was: 6.0.0)
7.0.0
> [C++] Parquet file writer snapshot API and proper ColumnChunk.file_path
> utilization
> -----------------------------------------------------------------------------------
>
> Key: ARROW-11465
> URL: https://issues.apache.org/jira/browse/ARROW-11465
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 3.0.0
> Reporter: Radu Teodorescu
> Assignee: Radu Teodorescu
> Priority: Major
> Labels: pull-request-available
> Fix For: 7.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This is a follow up to the thread:
> [https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%[email protected]%3e]
> The specific use case I am targeting is having the ability to partially read
> a parquet file while it's still being written to.
> This is relevant for any process that is recording events over a long period
> of times and writing them to parquet (tracing data, logging events or any
> other live time series)
> The solution relies on the fact that parquet specifications allows column
> chunk metadata to point explicitly to its location in a file which can
> theoretically be different from the file containing the metadata (as covered
> in other threads, this behavior is not fully supported by major parquet
> implementations).
> My solution is centered around adding a method,
>
> {{void ParquetFileWriter::Snapshot(const std::string& data_path,
> std::shared_ptr<::arrow::io::OutputStream>&
> sink) }}
> ,that writes writes the metadata for the RowGroups given so far to the
> {{sink}} stream and updates all the ColumnChunk metadata {{file_path}} to
> point to {{data_path}}. This was intended as a minimalist change to
> {{ParquetFileWriter}}
> On the reading side I implemented full support for ColumnChunk.file_path by
> introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}}
> in the {{ParquetFileReader}} implementation stack. In the PR implementation
> one can default to the current behavior by using the {{SingleFile}} class,
> have full read support for multi-file parquet in line with the specification
> by using {{MultiReadableFile}} implementation (that captures the metafile
> base directory and uses it as the base directory to the ColumChunk.file_path)
> or one can provide a separate implementation for a non-posix file system
> storage.
> For an example see {{write_parquet_file_with_snapshot}} function in
> reader-writer.cc that illustrates the snapshotting write while the
> {{read_whole_file}} function has been modified to read one of the snapshots
> (I will rollback that change and provide separate example before the merge)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)