Machiel Groeneveld created PARQUET-1155:
-------------------------------------------

             Summary: Support for GDPR erase requirements
                 Key: PARQUET-1155
                 URL: https://issues.apache.org/jira/browse/PARQUET-1155
             Project: Parquet
          Issue Type: Wish
          Components: parquet-format
    Affects Versions: 1.8.2
            Reporter: Machiel Groeneveld


As understand it Parquet is a write once thing. So mutating data inside Parquet 
files is not an option. Now there is a new cross EU law coming in effect May 
2018 that requires companies to delete data pertaining a customer if being 
asked to do so.

Our case is quite simple, our biggest parquet tables collect 7.5 billion rows a 
month. So removing data by duplicating this table whilst filtering out the 
unwanted customer data is not feasible. 

Perhaps there is some way to remove particular data? Or perhaps there is an 
efficient way to do read/filter/write? Perhaps zeroing the data is an idea to 
not change the layout of the files. 

Not sure if this is the right platform to start this discussion but I think 
more people will have this issue once it becomes clear that data needs to be 
deleted in all places, also in parquet files. Companies fase multi million 
dollar fines if they don't comply with GDPR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to