Re: Add 'prune' and 'mask' tools to Parquet-tools/cli

Gabor Szadovszky Thu, 20 Feb 2020 01:53:42 -0800

Hi Chao,

Some concerns were came up about the 'mask' feature prior to this thread
and there were already addressed by Xinli. Check PARQUET-1792 for details.
I don't have anything else to add here and happy to review the code changes
related to these features.


Cheers,
Gabor

On Thu, Feb 20, 2020 at 3:58 AM Chao Sun <[email protected]> wrote:

> Bumping it up. Would love to get some feedback from community.
>
> Best,
> Chao
>
> On Sun, Feb 16, 2020 at 7:20 PM Xinli shang <[email protected]>
> wrote:
>
> > Hi all,
> >
> > I am developing tools to prune or mask some Parquet file columns for
> > cost-saving or security & compliance purposes. I want to collect your
> > thoughts on usefulness or concerns. Please reply to this email or comment
> > on the tickets (Parquet-1791
> > <https://issues.apache.org/jira/browse/PARQUET-1791>, Parquet-1792
> > <https://issues.apache.org/jira/browse/PARQUET-1792>).
> >
> > These tools are going to be delivered in Parquet-tools/Parquet-cli
> project
> > and will be used as offline operations on existing files. In the prune
> > case, a new Parquet file will be generated by cloning all the columns
> > except for the ones to be pruned. There won’t be decoding/encoding,
> > decompression/compression on those surviving columns. So the throughput
> is
> > much higher than doing it in query engines. For the masked case, we are
> > going to do the same for the columns that don’t need to be masked. For
> the
> > columns to be masked, we are going to replace them with hash values,
> nulls
> > or user-defined translations.
> >
> > Why do we need these tools?
> >
> >    1.
> >
> >    Analytics tables usually have a lot of columns but some columns are
> >    never used or not being used after some time. Removing those columns
> can
> >    save storage costs. For some large scale tables, it could save several
> >    millions of dollars.
> >    2.
> >
> >    In some cases(not all though), the masked value for some sensitive
> >    columns, for example, the hash of the raw data, could be sufficient
> for
> >    analytical purposes. So we can replace the raw data with hash, null
> etc.
> >    3.
> >
> >    The alternative way to do so could be doing them in query engines. For
> >    example, select only the needed columns and save them into a new
> table.
> > But
> >    doing it in Parquet file directly can avoid decoding/encoding,
> >    decompression/compression and achieve higher (10+X in my testing)
> >    throughput. For large scale tables, high throughput is the key to
> > success.
> >
> >
> > Please note that this effort is not to replace data masking(obfuscation)
> > effort (PARQUET-1376) which should be independent of this and move
> forward.
> >
> > Thanks for spending time reading! Any comments are welcome!
> >
> > --
> > Xinli Shang
> >
>

Re: Add 'prune' and 'mask' tools to Parquet-tools/cli

Reply via email to