Re: Add 'prune' and 'mask' tools to Parquet-tools/cli

Chao Sun Wed, 19 Feb 2020 18:59:42 -0800

Bumping it up. Would love to get some feedback from community.

Best,
Chao


On Sun, Feb 16, 2020 at 7:20 PM Xinli shang <[email protected]> wrote:

> Hi all,
>
> I am developing tools to prune or mask some Parquet file columns for
> cost-saving or security & compliance purposes. I want to collect your
> thoughts on usefulness or concerns. Please reply to this email or comment
> on the tickets (Parquet-1791
> <https://issues.apache.org/jira/browse/PARQUET-1791>, Parquet-1792
> <https://issues.apache.org/jira/browse/PARQUET-1792>).
>
> These tools are going to be delivered in Parquet-tools/Parquet-cli project
> and will be used as offline operations on existing files. In the prune
> case, a new Parquet file will be generated by cloning all the columns
> except for the ones to be pruned. There won’t be decoding/encoding,
> decompression/compression on those surviving columns. So the throughput is
> much higher than doing it in query engines. For the masked case, we are
> going to do the same for the columns that don’t need to be masked. For the
> columns to be masked, we are going to replace them with hash values, nulls
> or user-defined translations.
>
> Why do we need these tools?
>
>    1.
>
>    Analytics tables usually have a lot of columns but some columns are
>    never used or not being used after some time. Removing those columns can
>    save storage costs. For some large scale tables, it could save several
>    millions of dollars.
>    2.
>
>    In some cases(not all though), the masked value for some sensitive
>    columns, for example, the hash of the raw data, could be sufficient for
>    analytical purposes. So we can replace the raw data with hash, null etc.
>    3.
>
>    The alternative way to do so could be doing them in query engines. For
>    example, select only the needed columns and save them into a new table.
> But
>    doing it in Parquet file directly can avoid decoding/encoding,
>    decompression/compression and achieve higher (10+X in my testing)
>    throughput. For large scale tables, high throughput is the key to
> success.
>
>
> Please note that this effort is not to replace data masking(obfuscation)
> effort (PARQUET-1376) which should be independent of this and move forward.
>
> Thanks for spending time reading! Any comments are welcome!
>
> --
> Xinli Shang
>

Re: Add 'prune' and 'mask' tools to Parquet-tools/cli

Reply via email to