Re: [DISCUSS] Parquet data masking/anonymization

Gidon Gershinsky Mon, 10 Aug 2020 04:15:54 -0700

Hi Micah,

Yep, we've been asking ourselves the same question; this is one of the
reasons we take this slowly.
The general answer is we want to help users to avoid the need to implement
the masking mechanism (and the privacy leakage analysis tools) on their own.
The idea is to create a common set of open source tools, that implement the
best practices in this field, and benefit from community's contribution in
terms of usecase requirements, design improvements and bug fixes.
Also, if we manage to find a way to compress N masked versions of the same
column, using an algorithm that produces (way) less than xN bytes, then we
might want to integrate the obfuscation feature deeper in the Parquet
stack. But this is an advanced goal, TBD.
We'll proceed top-down, starting with an above-the-surface tool that can
convert a regular file into a file with additional columns (masked versions
of the sensitive columns). Then we'll explore doing the same just under the
surface, when a new file is directly written with masked columns, added
automatically. We'll see then if we can/should go deeper.
The doc authors have motivating use-cases in their respective
organizations. We do ask for additional usecases / requirements, and
general feedback.


Cheers, Gidon


On Sat, Aug 8, 2020 at 7:10 AM Micah Kornfield <[email protected]>
wrote:

> Hi Gidon,
> Was there prior discussion on this on the mailing list?  I left a comment
> on the document, but it isn't clear to me why this particular use-case
> needs to be part of the core parquet library,
>
> Are there motivating use-cases that wouldn't be served by an external
> library/application level?
>
> Thanks,
> Micah
>
> On Mon, Aug 3, 2020 at 11:20 PM Gidon Gershinsky <[email protected]> wrote:
>
> > Hi all,
> >
> > Now that the encryption mechanism is mostly complete, we are starting a
> > long-term project on  a new security feature on top of encryption. Called
> > "data obfuscation",  it combines masking and anonymization of sensitive
> > data.
> > https://issues.apache.org/jira/browse/PARQUET-1376
> >
> > On the one hand, a basic masking can be easily implemented on top of
> > Parquet, by simply adding columns with masked (hashed, redacted, etc)
> > versions of the original column data. On the other hand, if done
> > improperly, data masking can leak out the sensitive information. For
> these
> > two reasons, we have decided not to rush it, this feature is not planned
> > for the upcoming Parquet versions. Following an initial discussion, we
> have
> > produced a write up on the goals, challenges and possible approaches.
> > Before drafting the design, we start with a call to the community to
> > provide feedback on this write up (eg via comments inside the doc). Any
> > real-life examples, usecases, requirements are very welcome.
> >
> >
> >
> https://docs.google.com/document/d/1LMs74uhqvMNJacBySPnWq6tM8qIpgcIZz444c7vfibM/edit?usp=sharing
> >
> >
> > Cheers,
> > Gidon, Xinli, Shri
> >
>

Re: [DISCUSS] Parquet data masking/anonymization

Reply via email to