[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034294#comment-17034294
 ] 

Gabor Szadovszky commented on PARQUET-1792:
-------------------------------------------

If you are talking about one file at a time you might be right that it is 10x 
faster than doing it by a query engine. But the tool is running on one node 
while the query engine uses several ones at the same time so I am not sure 
about the 10x performance.
Pruning the file makes sense to me to be written at the library level because 
you can do it in an effective way (do not need to unpack/decode the pages or 
the entire column chunks). To mask the values in the other hand requires to 
read the actual values and to generate the hashes. You also need to generate 
the related statistics.
Therefore, I am not sure if this masking feature properly suited for parquet-mr.

> Add 'mask' command to parquet-tools/parquet-cli
> -----------------------------------------------
>
>                 Key: PARQUET-1792
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1792
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to