[
https://issues.apache.org/jira/browse/PARQUET-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997093#comment-16997093
]
Machiel Groeneveld commented on PARQUET-1155:
---------------------------------------------
Hi [~zjumad] there is no news from the Parquet side. Though a recent
development in the community to deal with this problem is Delta Lake. They add
a layer on top of parquet to allow for deletions, although parquet is still
read only.
> Support for GDPR erase requirements
> -----------------------------------
>
> Key: PARQUET-1155
> URL: https://issues.apache.org/jira/browse/PARQUET-1155
> Project: Parquet
> Issue Type: Wish
> Components: parquet-format
> Affects Versions: 1.8.2
> Reporter: Machiel Groeneveld
> Priority: Major
>
> As understand it Parquet is a write once thing. So mutating data inside
> Parquet files is not an option. Now there is a new cross EU law coming in
> effect May 2018 that requires companies to delete data pertaining a customer
> if being asked to do so.
> Our case is quite simple, our biggest parquet tables collect 7.5 billion rows
> a month. So removing data by duplicating this table whilst filtering out the
> unwanted customer data is not feasible.
> Perhaps there is some way to remove particular data? Or perhaps there is an
> efficient way to do read/filter/write? Perhaps zeroing the data is an idea to
> not change the layout of the files.
> Not sure if this is the right platform to start this discussion but I think
> more people will have this issue once it becomes clear that data needs to be
> deleted in all places, also in parquet files. Companies fase multi million
> dollar fines if they don't comply with GDPR.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)