Hi all,
This week, 8 months after the first call for goals feedback and
requirements :), I got a new one - enabling old Parquet readers to access
data of unencrypted columns in encrypted files.
Better late than never.. But actually it doesn't sound unreasonable, and
deserved at least a consideration.
Let me describe the options (the way I see them). Any community feedback is
welcome.
But first, a little tech intro. Encrypted Parquet files can be created in
two modes - with an encrypted footer (lets call this an 'EF' mode for the
purpose of this discussion), or with a plaintext footer ('PF' mode).
EF is significantly more secure - it protects all data and metadata in a
file, including the schema, number of rows, key-value properties, column
names, column sort order, list of encrypted columns and metadata of the
column encryption keys.
PF hides the data, but leaks all of these metadata fields. Moreover, EF
makes the footer tamper-proof, while PF doesn't.
The reason we have the PF option is to let users with relaxed security
requirements to enable readers, that don't have access to any keys, to read
unencrypted columns in a file.
For encrypted columns, both EH and PH hide the ColumnMetaData - including
the min/max stats, number of values, data offset, data size and other
fields. Old Parquet readers obviously can't read EF files. But they can't
also read PF files - because old readers need access to data offset and
size of every column in a file, event if they try to read just one column
(this is fixed in an encryption pull request).
Now, the options:
1) Don't allow old Parquet readers to read encrypted files. Organizations
that start working with encrypted data, will update their analytic
frameworks to use an encrypting Parquet version. This includes both
frameworks that write/read encrypted columns, and frameworks that work only
with unencrypted columns. The former and latter can technically be the same
framework, just different instances of it. The update can be done in one of
the following ways:
a. Upgrade Parquet version to the latest one, supporting encryption. This
might require some changes in framework code, unrelated to encryption.
b. Use the original old Parquet version, with an added encryption support
(requires rebuilding the framework, no code changes). This is not hard, I'm
doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0 with
encrypted data.
I think I can post this for 1.8.2 and other versions, with some help from
the community.
2) Replace PF with PF~, in order to allow old Parquet readers to read
unencrypted columns in encrypted files. PF~ is a little less secure and a
little less elegant version of PF. Less secure because it has to expose the
offset and size of encrypted column data. But actually its not
catastrophic, and in any case, organizations with higher security
requirements will use the EF mode. Others can start with PF~ for a
transition period, and switch to EF later.
PH~ requires changing 2 lines in the parquet.thrift file, and a few dozen
lines in the implementation. I've played with this today, seems quite
feasible.
So, unless the community strongly favors option 1, I'm inclined to proceed
with 2, should take up to a week to get the prs submitted.
Cheers, Gidon.