Add 'prune' and 'mask' tools to Parquet-tools/cli

Xinli shang Sun, 16 Feb 2020 19:20:50 -0800

Hi all,

I am developing tools to prune or mask some Parquet file columns for
cost-saving or security & compliance purposes. I want to collect your
thoughts on usefulness or concerns. Please reply to this email or comment
on the tickets (Parquet-1791
<https://issues.apache.org/jira/browse/PARQUET-1791>, Parquet-1792
<https://issues.apache.org/jira/browse/PARQUET-1792>).


These tools are going to be delivered in Parquet-tools/Parquet-cli project
and will be used as offline operations on existing files. In the prune
case, a new Parquet file will be generated by cloning all the columns
except for the ones to be pruned. There won’t be decoding/encoding,
decompression/compression on those surviving columns. So the throughput is
much higher than doing it in query engines. For the masked case, we are
going to do the same for the columns that don’t need to be masked. For the
columns to be masked, we are going to replace them with hash values, nulls
or user-defined translations.

Why do we need these tools?

   1.

   Analytics tables usually have a lot of columns but some columns are
   never used or not being used after some time. Removing those columns can
   save storage costs. For some large scale tables, it could save several
   millions of dollars.
   2.

   In some cases(not all though), the masked value for some sensitive
   columns, for example, the hash of the raw data, could be sufficient for
   analytical purposes. So we can replace the raw data with hash, null etc.
   3.

   The alternative way to do so could be doing them in query engines. For
   example, select only the needed columns and save them into a new table. But
   doing it in Parquet file directly can avoid decoding/encoding,
   decompression/compression and achieve higher (10+X in my testing)
   throughput. For large scale tables, high throughput is the key to success.


Please note that this effort is not to replace data masking(obfuscation)
effort (PARQUET-1376) which should be independent of this and move forward.

Thanks for spending time reading! Any comments are welcome!

-- 
Xinli Shang

Add 'prune' and 'mask' tools to Parquet-tools/cli

Reply via email to