raduteo opened a new pull request #8130:
URL: https://github.com/apache/arrow/pull/8130


   This is a follow up to the thread:
   
https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%[email protected]%3e
   
   The specific use case I am targeting is having the ability to partially read 
a parquet file while it's still being written to.
   This is relevant for any process that is recording events over a long period 
of times and writing them to parquet (tracing data, logging events or any other 
live time series)
   The solution relies on the fact that parquet specifications allows column 
chunk metadata to point explicitly to its location in a file which can 
theoretically be different from the file containing the metadata (as covered in 
other threads, this behavior is not fully supported by major parquet 
implementations).
   My solution is centered around adding a method,
   ```
   void ParquetFileWriter::Snapshot(const std::string& data_path,
                                    std::shared_ptr<::arrow::io::OutputStream>& 
sink) 
   
   ```
   ,that writes writes the metadata for the RowGroups given so far to the 
`sink` stream and updates all the ColumnChunk metadata `file_path` to point to 
`data_path`. This was intended as a minimalist change to `ParquetFileWriter`
   
   On the reading side I implemented full support for ColumnChunk.file_path by 
introducing `ArrowMultiInputFile` as an alternative to `ArrowInputFile` in the 
`ParquetFileReader` implementation stack. In the PR implementation one can 
default to the current behavior by using the `SingleFile` class, have full read 
support for multi-file parquet in line with the specification by using 
`MultiReadableFile` implementation (that captures the metafile base directory 
and uses it as the base directory to the ColumChunk.file_path) or one can 
provide a separate implementation for a non-posix file system storage. 
   
   For an example see `write_parquet_file_with_snapshot` function in 
reader-writer.cc that illustrates the snapshotting write while the 
`read_whole_file` function has been modified to read one of the snapshots (I 
will rollback that change and provide separate example before the merge)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to