[
https://issues.apache.org/jira/browse/PARQUET-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Lee resolved PARQUET-1289.
--------------------------------
Resolution: Won't Fix
Reworking a new spec
> Spec for Updateable Parquet
> ---------------------------
>
> Key: PARQUET-1289
> URL: https://issues.apache.org/jira/browse/PARQUET-1289
> Project: Parquet
> Issue Type: Wish
> Components: parquet-format
> Reporter: David Lee
> Priority: Minor
>
> Parquet today is a read only columnar format, but can we also make it
> updateable using the methods in Apache Arrow for row filtering?
> Here's how it would work:
> A. Add an insert timestamp for every single record in a parquet file.
> B. Add a list of modifiable row offsets to the parquet file's footer for
> records in the parquet file which have been logically deleted. We should also
> include the delete timestamp for every offset as well in order to reproduce
> snapshot of what data looked like at any point in time.
> C. If a parquet record is ever update. The updated record would be a new
> record in a different parquet file and the old record in the parquet file
> would be logically deleted by adding its row offset to its parquet file's
> footer. We would need a service that does this.
> D. When reading parquet files. Logically deleted rows would be excluded.
> E. Alternatively when reading parquet files with a snapshot time any rows in
> the parquet files with an insert timestamp > snapshot time would be excluded
> and any rows which have been logically flagged for deletion would be included
> if delete timestamp < snapshot time.
> This way we do not have to reorganize the columnar data in existing parquet
> files. We just have to modify the metadata footer.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)