Hi, Sounds good. I've submitted a pr for this, and updated the Encryption.md pr.
Cheers, Gidon. On Wed, Sep 19, 2018 at 9:54 AM Zoltan Ivanfi <[email protected]> wrote: > Hi, > > Sounds good to me. The filename extension could really help to prevent > confusion. > > Br, > > Zoltan > > On Tue, Sep 18, 2018 at 4:35 PM Gidon Gershinsky <[email protected]> wrote: > > > Hi, > > > > 2 cents re the first point - the encrypted files will have an extension > > "parquet.encrypted", which should help people understand the reason for > > their error. They also should be aware that using old readers for > encrypted > > files is a temporary solution, the right thing to do is to upgrade to new > > Parquet version. > > But I'm also ok with a truly incompatible format for the encrypted files. > > > > On Tue, Sep 18, 2018 at 5:07 PM Zoltan Ivanfi <[email protected]> > > wrote: > > > > > Hi, > > > > > > I'm a little bit worried that the misleading error message could lead > to > > > serious confusion. For this reason, I would slighlty prefer a truly > > > incompatible format for the encrypted files, but I don't have strong > > > feelings against doing it the other way either. > > > > > > One idea that came to my mind (which could easily be a bad idea) is to > > > write two metadata sections, one for new readers and one for older > ones. > > > The latter would not contain references to encrypted columns at all. > > > > > > Br, > > > > > > Zoltan > > > > > > On Tue, Sep 18, 2018 at 10:40 AM Gidon Gershinsky <[email protected]> > > > wrote: > > > > > > > Hi Zoltan, > > > > > > > > Old readers, trying to access encrypted columns in PF~ files, get a > > > Thrift > > > > parsing exception, since they expect a plaintext PageHeader structure > > at > > > > the page offset. > > > > In encrypted columns, PageHeaders are encrypted with the column key. > > > > > > > > Old Parquet binding in any language should be able to read plaintext > > > > columns in PF~ files. > > > > > > > > Cheers, Gidon. > > > > > > > > > > > > On Tue, Sep 18, 2018 at 11:19 AM Zoltan Ivanfi > <[email protected] > > > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > Just to clarify: PF~ allows older readers to read data as long as > > they > > > > only > > > > > try to access unencrypted columns. What happens when older readers > do > > > try > > > > > to access encrypted columns? > > > > > > > > > > Also, by older readers do you specificially mean the current Java > > > library > > > > > or all existing language bindings? > > > > > > > > > > Thanks, > > > > > > > > > > Zoltan > > > > > > > > > > On Tue, Sep 18, 2018 at 9:45 AM Gidon Gershinsky <[email protected] > > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > This week, 8 months after the first call for goals feedback and > > > > > > requirements :), I got a new one - enabling old Parquet readers > to > > > > access > > > > > > data of unencrypted columns in encrypted files. > > > > > > Better late than never.. But actually it doesn't sound > > unreasonable, > > > > and > > > > > > deserved at least a consideration. > > > > > > > > > > > > Let me describe the options (the way I see them). Any community > > > > feedback > > > > > is > > > > > > welcome. > > > > > > > > > > > > But first, a little tech intro. Encrypted Parquet files can be > > > created > > > > in > > > > > > two modes - with an encrypted footer (lets call this an 'EF' mode > > for > > > > the > > > > > > purpose of this discussion), or with a plaintext footer ('PF' > > mode). > > > > > > EF is significantly more secure - it protects all data and > metadata > > > in > > > > a > > > > > > file, including the schema, number of rows, key-value properties, > > > > column > > > > > > names, column sort order, list of encrypted columns and metadata > of > > > the > > > > > > column encryption keys. > > > > > > PF hides the data, but leaks all of these metadata fields. > > Moreover, > > > EF > > > > > > makes the footer tamper-proof, while PF doesn't. > > > > > > The reason we have the PF option is to let users with relaxed > > > security > > > > > > requirements to enable readers, that don't have access to any > keys, > > > to > > > > > read > > > > > > unencrypted columns in a file. > > > > > > > > > > > > For encrypted columns, both EH and PH hide the ColumnMetaData - > > > > including > > > > > > the min/max stats, number of values, data offset, data size and > > other > > > > > > fields. Old Parquet readers obviously can't read EF files. But > they > > > > can't > > > > > > also read PF files - because old readers need access to data > offset > > > and > > > > > > size of every column in a file, event if they try to read just > one > > > > column > > > > > > (this is fixed in an encryption pull request). > > > > > > > > > > > > Now, the options: > > > > > > > > > > > > 1) Don't allow old Parquet readers to read encrypted files. > > > > Organizations > > > > > > that start working with encrypted data, will update their > analytic > > > > > > frameworks to use an encrypting Parquet version. This includes > both > > > > > > frameworks that write/read encrypted columns, and frameworks that > > > work > > > > > only > > > > > > with unencrypted columns. The former and latter can technically > be > > > the > > > > > same > > > > > > framework, just different instances of it. The update can be done > > in > > > > one > > > > > of > > > > > > the following ways: > > > > > > a. Upgrade Parquet version to the latest one, supporting > > encryption. > > > > This > > > > > > might require some changes in framework code, unrelated to > > > encryption. > > > > > > b. Use the original old Parquet version, with an added encryption > > > > support > > > > > > (requires rebuilding the framework, no code changes). This is not > > > hard, > > > > > I'm > > > > > > doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0 > > with > > > > > > encrypted data. > > > > > > I think I can post this for 1.8.2 and other versions, with some > > help > > > > from > > > > > > the community. > > > > > > > > > > > > 2) Replace PF with PF~, in order to allow old Parquet readers to > > read > > > > > > unencrypted columns in encrypted files. PF~ is a little less > secure > > > > and a > > > > > > little less elegant version of PF. Less secure because it has to > > > expose > > > > > the > > > > > > offset and size of encrypted column data. But actually its not > > > > > > catastrophic, and in any case, organizations with higher security > > > > > > requirements will use the EF mode. Others can start with PF~ for > a > > > > > > transition period, and switch to EF later. > > > > > > PH~ requires changing 2 lines in the parquet.thrift file, and a > few > > > > dozen > > > > > > lines in the implementation. I've played with this today, seems > > quite > > > > > > feasible. > > > > > > So, unless the community strongly favors option 1, I'm inclined > to > > > > > proceed > > > > > > with 2, should take up to a week to get the prs submitted. > > > > > > > > > > > > Cheers, Gidon. > > > > > > > > > > > > > > > > > > > > >
