David Lee created PARQUET-1289:
----------------------------------

             Summary: Spec for Updateable Parquet
                 Key: PARQUET-1289
                 URL: https://issues.apache.org/jira/browse/PARQUET-1289
             Project: Parquet
          Issue Type: Wish
          Components: parquet-format
            Reporter: David Lee


Parquet today is a read only columnar format, but can we also make it 
updateable using the methods in Apache Arrow for row filtering?

Here's how it would work:

A. Add an insert timestamp for every single record in a parquet file.
B. Add a list of modifiable row offsets to the parquet file's footer for 
records in the parquet file which have been logically deleted. We should also 
include the delete timestamp for every offset as well in order to reproduce 
snapshot of what data looked like at any point in time.
C. If a parquet record is ever update. The updated record would be a new record 
and the old record in the parquet file would be logically deleted by adding its 
row offset to its parquet file's footer. We would need a service that does this.
D. When reading parquet files. Logically deleted rows would be excluded.
E. Alternatively when reading parquet files with a snapshot time any rows in 
the parquet files with an insert timestamp > snapshot time would be excluded 
and any rows which have been logically flagged for deletion would be included 
if delete timestamp < snapshop time.

This way we do not have to reorganizing the columnar data in existing parquet 
files. We just have to modify the metadata footer.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to