[ 
https://issues.apache.org/jira/browse/PARQUET-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997093#comment-16997093
 ] 

Machiel Groeneveld commented on PARQUET-1155:
---------------------------------------------

Hi [~zjumad] there is no news from the Parquet side. Though a recent 
development in the community to deal with this problem is Delta Lake. They add 
a layer on top of parquet to allow for deletions, although parquet is still 
read only. 

> Support for GDPR erase requirements
> -----------------------------------
>
>                 Key: PARQUET-1155
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1155
>             Project: Parquet
>          Issue Type: Wish
>          Components: parquet-format
>    Affects Versions: 1.8.2
>            Reporter: Machiel Groeneveld
>            Priority: Major
>
> As understand it Parquet is a write once thing. So mutating data inside 
> Parquet files is not an option. Now there is a new cross EU law coming in 
> effect May 2018 that requires companies to delete data pertaining a customer 
> if being asked to do so.
> Our case is quite simple, our biggest parquet tables collect 7.5 billion rows 
> a month. So removing data by duplicating this table whilst filtering out the 
> unwanted customer data is not feasible. 
> Perhaps there is some way to remove particular data? Or perhaps there is an 
> efficient way to do read/filter/write? Perhaps zeroing the data is an idea to 
> not change the layout of the files. 
> Not sure if this is the right platform to start this discussion but I think 
> more people will have this issue once it becomes clear that data needs to be 
> deleted in all places, also in parquet files. Companies fase multi million 
> dollar fines if they don't comply with GDPR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to