Hi Ryan, Certainly, I'd be glad to draft a design doc, sounds like a good idea. Could you assign me to PARQUET-1178? I'll pin the doc link there.
I've seen a brief discussion on creating an 'encrypting compressor', but indeed for data pages only. My implementation encrypts pages (data and dictionary), headers and the file footer. Also, I don't use a separate compressor for encryption, my code works with any compression supported in Parquet. Regards, Gidon. On Wed, Dec 20, 2017 at 7:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote: > Hi Gidon, > > Thanks for working on this. People have talked about using this approach > for page data in the past, but I haven't seen an implementation of it. You > encrypt headers as well to make sure column stats are not stored in plain > text? > > I think it would be helpful if you wrote up a small doc on your changes, > like the design doc for column indexes > <https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF > xf8U_Do5K2wSO4/edit#>. > That way, we can discuss it in comments to make sure that you didn't miss > any structures and validate the approach. Would you be willing to do that? > Then we could figure out what we need to add to the Parquet spec to make > this portable. > > Thanks! > > rb > > On Wed, Dec 20, 2017 at 7:46 AM, Gidon Gershinsky <gi...@il.ibm.com> > wrote: > > > 'Hi' is missing in the message, due to a faulty copy/paste, sorry about > > that :) > > > > And of course, all comments are most welcome. > > > > > > > > Regards, > > Gidon > > > > > > > > > > > > > > > > From: Gidon Gershinsky <gg5...@gmail.com> > > To: dev@parquet.apache.org > > Date: 20/12/2017 05:33 PM > > Subject: Parquet modular encryption > > > > > > > > We are working on frameworks that perform secure analytics on encrypted > > data. The analytic engine runs in a secured environment, but the data is > > kept in an untrusted storage. Could be a public cloud storage or anything > > else - the main requirement is to store/retrieve the data in an encrypted > > form only. The storage admin should never have the data key. The data > > should be decrypted only at the end point (analytic engine), never in the > > storage. > > > > This obviously impacts performance of Parquet selective reads. If a > > Parquet > > file is bulk-encrypted in the storage, it becomes impossible to extract > > its > > footer, retrieve a column subset, a few pages, etc. The file must be > fully > > delivered from storage to the engine location, decrypted there, and then > > processed. > > Moreover, even if the storage is trusted - it still has to fully decrypt > > the file before parsing it and extracting select columns/pages. > > > > I've searched for available solutions to this problem, haven't found any > > (but do let me know if I've missed anything!) > > > > So I have developed a basic Parquet implementation that performs separate > > encryption of each header and page. It is fully functional, and allows to > > retrieve only the required data pieces, while keeping the Parquet file > > encrypted in the storage. Actually, it doesn't require deep changes in > > Parquet code, since it builds on the existing Thrift and compression > > mechanisms. Its also not intrusive, in a sense that if encryption is not > > used, the new code is by-passed with a number of 'ifs', so the existing > > apps and tests continue to run unaffected. > > > > Its still a raw code, not quite ready yet for upstreaming. Unless you > guys > > tell me this is pointless :), I'll start on preparing it for a pull > > request. > > > > > > > > Regards, > > Gidon > > > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix >