David Lee created PARQUET-1289:
----------------------------------
Summary: Spec for Updateable Parquet
Key: PARQUET-1289
URL: https://issues.apache.org/jira/browse/PARQUET-1289
Project: Parquet
Issue Type: Wish
Components: parquet-format
Reporter: David Lee
Parquet today is a read only columnar format, but can we also make it
updateable using the methods in Apache Arrow for row filtering?
Here's how it would work:
A. Add an insert timestamp for every single record in a parquet file.
B. Add a list of modifiable row offsets to the parquet file's footer for
records in the parquet file which have been logically deleted. We should also
include the delete timestamp for every offset as well in order to reproduce
snapshot of what data looked like at any point in time.
C. If a parquet record is ever update. The updated record would be a new record
and the old record in the parquet file would be logically deleted by adding its
row offset to its parquet file's footer. We would need a service that does this.
D. When reading parquet files. Logically deleted rows would be excluded.
E. Alternatively when reading parquet files with a snapshot time any rows in
the parquet files with an insert timestamp > snapshot time would be excluded
and any rows which have been logically flagged for deletion would be included
if delete timestamp < snapshop time.
This way we do not have to reorganizing the columnar data in existing parquet
files. We just have to modify the metadata footer.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)