I have posted a link to the design draft at the Jira.
All comments are welcome.
Thanks to Julien for an initial feedback and suggestions, at our chat 
during my SF trip last week.

The design document is relatively detailed, with 8 pages its actually 
longer than the code used to implement it :).
Still, the code can be split into multiple pull requests, to enable a 
staged implementation of this mechanism.

Is there a Parquet call next week? I'd be glad to join and discuss this 
with the community.



Regards, 
Gidon







From:   Gidon Gershinsky <gg5...@gmail.com>
To:     dev@parquet.apache.org, rb...@netflix.com
Date:   20/12/2017 08:13 PM
Subject:        Re: Parquet modular encryption



Hi Ryan,

Certainly, I'd be glad to draft a design doc, sounds like a good idea.
Could you assign me to PARQUET-1178? I'll pin the doc link there.

I've seen a brief discussion on creating an 'encrypting compressor', but
indeed for data pages only.
My implementation encrypts pages (data and dictionary), headers and the
file footer. Also, I don't use a separate compressor for encryption, my
code works with any compression supported in Parquet.

Regards, Gidon.

On Wed, Dec 20, 2017 at 7:12 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Hi Gidon,
>
> Thanks for working on this. People have talked about using this approach
> for page data in the past, but I haven't seen an implementation of it. 
You
> encrypt headers as well to make sure column stats are not stored in 
plain
> text?
>
> I think it would be helpful if you wrote up a small doc on your changes,
> like the design doc for column indexes
> <
https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1sBACp8Lbutuj1Zxdowvsrlm8ku4BF&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=xR6HJBGHfjijqP-JgubSvA&m=IyTKfGy6KePzS2zPPfNzfq9G9ac88N5DimmeL5o20kc&s=KIzWAU6JFlspk28i50NbPvA8aUr8AZXGu2BqLNuUuE4&e=

> xf8U_Do5K2wSO4/edit#>.
> That way, we can discuss it in comments to make sure that you didn't 
miss
> any structures and validate the approach. Would you be willing to do 
that?
> Then we could figure out what we need to add to the Parquet spec to make
> this portable.
>
> Thanks!
>
> rb
>
> On Wed, Dec 20, 2017 at 7:46 AM, Gidon Gershinsky <gi...@il.ibm.com>
> wrote:
>
> > 'Hi' is missing in the message, due to a faulty copy/paste, sorry 
about
> > that :)
> >
> > And of course, all comments are most welcome.
> >
> >
> >
> > Regards,
> > Gidon
> >
> >
> >
> >
> >
> >
> >
> > From:   Gidon Gershinsky <gg5...@gmail.com>
> > To:     dev@parquet.apache.org
> > Date:   20/12/2017 05:33 PM
> > Subject:        Parquet modular encryption
> >
> >
> >
> > We are working on frameworks that perform secure analytics on 
encrypted
> > data. The analytic engine runs in a secured environment, but the data 
is
> > kept in an untrusted storage. Could be a public cloud storage or 
anything
> > else - the main requirement is to store/retrieve the data in an 
encrypted
> > form only. The storage admin should never have the data key. The data
> > should be decrypted only at the end point (analytic engine), never in 
the
> > storage.
> >
> > This obviously impacts performance of Parquet selective reads. If a
> > Parquet
> > file is bulk-encrypted in the storage, it becomes impossible to 
extract
> > its
> > footer, retrieve a column subset, a few pages, etc. The file must be
> fully
> > delivered from storage to the engine location, decrypted there, and 
then
> > processed.
> > Moreover, even if the storage is trusted - it still has to fully 
decrypt
> > the file before parsing it and extracting select columns/pages.
> >
> > I've searched for available solutions to this problem, haven't found 
any
> > (but do let me know if I've missed anything!)
> >
> > So I have developed a basic Parquet implementation that performs 
separate
> > encryption of each header and page. It is fully functional, and allows 
to
> > retrieve only the required data pieces, while keeping the Parquet file
> > encrypted in the storage. Actually, it doesn't require deep changes in
> > Parquet code, since it builds on the existing Thrift and compression
> > mechanisms. Its also not intrusive, in a sense that if encryption is 
not
> > used, the new code is by-passed with a number of 'ifs', so the 
existing
> > apps and tests continue to run unaffected.
> >
> > Its still a raw code, not quite ready yet for upstreaming. Unless you
> guys
> > tell me this is pointless :), I'll start on preparing it for a pull
> > request.
> >
> >
> >
> > Regards,
> > Gidon
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



Reply via email to