[ 
https://issues.apache.org/jira/browse/PARQUET-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee resolved PARQUET-1289.
--------------------------------
    Resolution: Won't Fix

Reworking a new spec

> Spec for Updateable Parquet
> ---------------------------
>
>                 Key: PARQUET-1289
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1289
>             Project: Parquet
>          Issue Type: Wish
>          Components: parquet-format
>            Reporter: David Lee
>            Priority: Minor
>
> Parquet today is a read only columnar format, but can we also make it 
> updateable using the methods in Apache Arrow for row filtering?
> Here's how it would work:
> A. Add an insert timestamp for every single record in a parquet file.
> B. Add a list of modifiable row offsets to the parquet file's footer for 
> records in the parquet file which have been logically deleted. We should also 
> include the delete timestamp for every offset as well in order to reproduce 
> snapshot of what data looked like at any point in time.
> C. If a parquet record is ever update. The updated record would be a new 
> record in a different parquet file and the old record in the parquet file 
> would be logically deleted by adding its row offset to its parquet file's 
> footer. We would need a service that does this.
> D. When reading parquet files. Logically deleted rows would be excluded.
> E. Alternatively when reading parquet files with a snapshot time any rows in 
> the parquet files with an insert timestamp > snapshot time would be excluded 
> and any rows which have been logically flagged for deletion would be included 
> if delete timestamp < snapshot time.
> This way we do not have to reorganize the columnar data in existing parquet 
> files. We just have to modify the metadata footer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to