Hi Micah, Yep, we've been asking ourselves the same question; this is one of the reasons we take this slowly. The general answer is we want to help users to avoid the need to implement the masking mechanism (and the privacy leakage analysis tools) on their own. The idea is to create a common set of open source tools, that implement the best practices in this field, and benefit from community's contribution in terms of usecase requirements, design improvements and bug fixes. Also, if we manage to find a way to compress N masked versions of the same column, using an algorithm that produces (way) less than xN bytes, then we might want to integrate the obfuscation feature deeper in the Parquet stack. But this is an advanced goal, TBD. We'll proceed top-down, starting with an above-the-surface tool that can convert a regular file into a file with additional columns (masked versions of the sensitive columns). Then we'll explore doing the same just under the surface, when a new file is directly written with masked columns, added automatically. We'll see then if we can/should go deeper. The doc authors have motivating use-cases in their respective organizations. We do ask for additional usecases / requirements, and general feedback.
Cheers, Gidon On Sat, Aug 8, 2020 at 7:10 AM Micah Kornfield <[email protected]> wrote: > Hi Gidon, > Was there prior discussion on this on the mailing list? I left a comment > on the document, but it isn't clear to me why this particular use-case > needs to be part of the core parquet library, > > Are there motivating use-cases that wouldn't be served by an external > library/application level? > > Thanks, > Micah > > On Mon, Aug 3, 2020 at 11:20 PM Gidon Gershinsky <[email protected]> wrote: > > > Hi all, > > > > Now that the encryption mechanism is mostly complete, we are starting a > > long-term project on a new security feature on top of encryption. Called > > "data obfuscation", it combines masking and anonymization of sensitive > > data. > > https://issues.apache.org/jira/browse/PARQUET-1376 > > > > On the one hand, a basic masking can be easily implemented on top of > > Parquet, by simply adding columns with masked (hashed, redacted, etc) > > versions of the original column data. On the other hand, if done > > improperly, data masking can leak out the sensitive information. For > these > > two reasons, we have decided not to rush it, this feature is not planned > > for the upcoming Parquet versions. Following an initial discussion, we > have > > produced a write up on the goals, challenges and possible approaches. > > Before drafting the design, we start with a call to the community to > > provide feedback on this write up (eg via comments inside the doc). Any > > real-life examples, usecases, requirements are very welcome. > > > > > > > https://docs.google.com/document/d/1LMs74uhqvMNJacBySPnWq6tM8qIpgcIZz444c7vfibM/edit?usp=sharing > > > > > > Cheers, > > Gidon, Xinli, Shri > > >
