I have posted a link to the design draft at the Jira. All comments are welcome. Thanks to Julien for an initial feedback and suggestions, at our chat during my SF trip last week.
The design document is relatively detailed, with 8 pages its actually longer than the code used to implement it :). Still, the code can be split into multiple pull requests, to enable a staged implementation of this mechanism. Is there a Parquet call next week? I'd be glad to join and discuss this with the community. Regards, Gidon From: Gidon Gershinsky <gg5...@gmail.com> To: dev@parquet.apache.org, rb...@netflix.com Date: 20/12/2017 08:13 PM Subject: Re: Parquet modular encryption Hi Ryan, Certainly, I'd be glad to draft a design doc, sounds like a good idea. Could you assign me to PARQUET-1178? I'll pin the doc link there. I've seen a brief discussion on creating an 'encrypting compressor', but indeed for data pages only. My implementation encrypts pages (data and dictionary), headers and the file footer. Also, I don't use a separate compressor for encryption, my code works with any compression supported in Parquet. Regards, Gidon. On Wed, Dec 20, 2017 at 7:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote: > Hi Gidon, > > Thanks for working on this. People have talked about using this approach > for page data in the past, but I haven't seen an implementation of it. You > encrypt headers as well to make sure column stats are not stored in plain > text? > > I think it would be helpful if you wrote up a small doc on your changes, > like the design doc for column indexes > < https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1sBACp8Lbutuj1Zxdowvsrlm8ku4BF&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=xR6HJBGHfjijqP-JgubSvA&m=IyTKfGy6KePzS2zPPfNzfq9G9ac88N5DimmeL5o20kc&s=KIzWAU6JFlspk28i50NbPvA8aUr8AZXGu2BqLNuUuE4&e= > xf8U_Do5K2wSO4/edit#>. > That way, we can discuss it in comments to make sure that you didn't miss > any structures and validate the approach. Would you be willing to do that? > Then we could figure out what we need to add to the Parquet spec to make > this portable. > > Thanks! > > rb > > On Wed, Dec 20, 2017 at 7:46 AM, Gidon Gershinsky <gi...@il.ibm.com> > wrote: > > > 'Hi' is missing in the message, due to a faulty copy/paste, sorry about > > that :) > > > > And of course, all comments are most welcome. > > > > > > > > Regards, > > Gidon > > > > > > > > > > > > > > > > From: Gidon Gershinsky <gg5...@gmail.com> > > To: dev@parquet.apache.org > > Date: 20/12/2017 05:33 PM > > Subject: Parquet modular encryption > > > > > > > > We are working on frameworks that perform secure analytics on encrypted > > data. The analytic engine runs in a secured environment, but the data is > > kept in an untrusted storage. Could be a public cloud storage or anything > > else - the main requirement is to store/retrieve the data in an encrypted > > form only. The storage admin should never have the data key. The data > > should be decrypted only at the end point (analytic engine), never in the > > storage. > > > > This obviously impacts performance of Parquet selective reads. If a > > Parquet > > file is bulk-encrypted in the storage, it becomes impossible to extract > > its > > footer, retrieve a column subset, a few pages, etc. The file must be > fully > > delivered from storage to the engine location, decrypted there, and then > > processed. > > Moreover, even if the storage is trusted - it still has to fully decrypt > > the file before parsing it and extracting select columns/pages. > > > > I've searched for available solutions to this problem, haven't found any > > (but do let me know if I've missed anything!) > > > > So I have developed a basic Parquet implementation that performs separate > > encryption of each header and page. It is fully functional, and allows to > > retrieve only the required data pieces, while keeping the Parquet file > > encrypted in the storage. Actually, it doesn't require deep changes in > > Parquet code, since it builds on the existing Thrift and compression > > mechanisms. Its also not intrusive, in a sense that if encryption is not > > used, the new code is by-passed with a number of 'ifs', so the existing > > apps and tests continue to run unaffected. > > > > Its still a raw code, not quite ready yet for upstreaming. Unless you > guys > > tell me this is pointless :), I'll start on preparing it for a pull > > request. > > > > > > > > Regards, > > Gidon > > > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix >