Hi Ryan,

Certainly, I'd be glad to draft a design doc, sounds like a good idea.
Could you assign me to PARQUET-1178? I'll pin the doc link there.

I've seen a brief discussion on creating an 'encrypting compressor', but
indeed for data pages only.
My implementation encrypts pages (data and dictionary), headers and the
file footer. Also, I don't use a separate compressor for encryption, my
code works with any compression supported in Parquet.

Regards, Gidon.

On Wed, Dec 20, 2017 at 7:12 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Hi Gidon,
>
> Thanks for working on this. People have talked about using this approach
> for page data in the past, but I haven't seen an implementation of it. You
> encrypt headers as well to make sure column stats are not stored in plain
> text?
>
> I think it would be helpful if you wrote up a small doc on your changes,
> like the design doc for column indexes
> <https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> xf8U_Do5K2wSO4/edit#>.
> That way, we can discuss it in comments to make sure that you didn't miss
> any structures and validate the approach. Would you be willing to do that?
> Then we could figure out what we need to add to the Parquet spec to make
> this portable.
>
> Thanks!
>
> rb
>
> On Wed, Dec 20, 2017 at 7:46 AM, Gidon Gershinsky <gi...@il.ibm.com>
> wrote:
>
> > 'Hi' is missing in the message, due to a faulty copy/paste, sorry about
> > that :)
> >
> > And of course, all comments are most welcome.
> >
> >
> >
> > Regards,
> > Gidon
> >
> >
> >
> >
> >
> >
> >
> > From:   Gidon Gershinsky <gg5...@gmail.com>
> > To:     dev@parquet.apache.org
> > Date:   20/12/2017 05:33 PM
> > Subject:        Parquet modular encryption
> >
> >
> >
> > We are working on frameworks that perform secure analytics on encrypted
> > data. The analytic engine runs in a secured environment, but the data is
> > kept in an untrusted storage. Could be a public cloud storage or anything
> > else - the main requirement is to store/retrieve the data in an encrypted
> > form only. The storage admin should never have the data key. The data
> > should be decrypted only at the end point (analytic engine), never in the
> > storage.
> >
> > This obviously impacts performance of Parquet selective reads. If a
> > Parquet
> > file is bulk-encrypted in the storage, it becomes impossible to extract
> > its
> > footer, retrieve a column subset, a few pages, etc. The file must be
> fully
> > delivered from storage to the engine location, decrypted there, and then
> > processed.
> > Moreover, even if the storage is trusted - it still has to fully decrypt
> > the file before parsing it and extracting select columns/pages.
> >
> > I've searched for available solutions to this problem, haven't found any
> > (but do let me know if I've missed anything!)
> >
> > So I have developed a basic Parquet implementation that performs separate
> > encryption of each header and page. It is fully functional, and allows to
> > retrieve only the required data pieces, while keeping the Parquet file
> > encrypted in the storage. Actually, it doesn't require deep changes in
> > Parquet code, since it builds on the existing Thrift and compression
> > mechanisms. Its also not intrusive, in a sense that if encryption is not
> > used, the new code is by-passed with a number of 'ifs', so the existing
> > apps and tests continue to run unaffected.
> >
> > Its still a raw code, not quite ready yet for upstreaming. Unless you
> guys
> > tell me this is pointless :), I'll start on preparing it for a pull
> > request.
> >
> >
> >
> > Regards,
> > Gidon
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to