[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034553#comment-17034553
 ] 

Xinli Shang edited comment on PARQUET-1792 at 2/11/20 3:47 PM:
---------------------------------------------------------------

[~gszadovszky] the tool can be run in parallel in a cluster. For example, we 
can easily write a Spark application to do it. Actually even for 'prune', we 
still need to write Spark application to parallel it. Otherwise, the time to 
finish is still significant, although it is already faster than doing it in 
query engines.  

Regarding reading the original value and generating the hash/statistics, we 
only need to do it for the columns to be masked. In many cases, what we see is 
that there are only very few columns to be masked. For all other columns that 
don't need to be masked, we just move them as a whole like 'merge' or 'prune' 
command, which would be a big saving. But yes, this operation would be slower 
than 'prune' command, but it still can save huge comparing with doing it via 
query engine.  


was (Author: sha...@uber.com):
[~gszadovszky] the tool can be run in parallel in a cluster. For example, we 
can easily write a Spark application to do it. Actually even for 'prune', we 
still need to write Spark application to parallel it. Otherwise, the time to 
finish is still significant, although it is already faster than doing it in 
query engines.  

Regarding reading the original value and generating the hash/statistics, we 
only need to do it for the columns to be masked. In many cases, what we see is 
that there are only very few columns to be masked. For all other columns that 
don't need to be masked, we just move them as a whole like 'merge' or 'prune' 
command, which would be a big saving. But yes, this operation would be slower 
than 'prune' command, but it still can save huge comparing with doing it via 
query engine. 

 

 

 

 

 

> Add 'mask' command to parquet-tools/parquet-cli
> -----------------------------------------------
>
>                 Key: PARQUET-1792
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1792
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to