Hi Gidon, A couple of questions on the proposed structs: 1. "kms_client_class" This sounds like it might be a very Java centric approach. Have you given consideration to how this can be used in C++/Python? Should I just RTFD (which one)?
2. std::unordered_map<std::string, std::string> custom_kms_conf; Is uniqueness of "keys" intentional (i.e. why not std::vector<std::pair<>>)? 3. It seems like a slightly asymmetric API to have key-value pairs separately for custom_kms_conf and packing all column key metadata into a serialized string. Is there a reason for this? Thanks, Micah On Wed, Jul 8, 2020 at 11:06 PM Gidon Gershinsky <gg5...@gmail.com> wrote: > Ok, so we had a look with Tham at the current pyarrow and parquet-cpp > configuration objects. There is no Hadoop-like free map (this is good, I > guess). Instead, the property keys are pre-defined in most objects. > > But some objects (such as HdfsConnectionConfig , > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.h#L87) > have a number of pre-defined keys - and a free string-to-string map, > `extra_conf`. This approach is a good fit for us, because we build tools > that allow to work with different external KMS's (encryption Key Management > Services). Each KMS requires a custom client that will connect > parquet encryption to the KMS server. We provide an interface for such > clients; many properties are pre-defined, but the custom client > implementations will require custom properties. We'll define configuration > objects that will look like this: > > struct KmsConnectionConfig { > std::string kms_client_class; > std::string kms_instance_id; > std::string kms_instance_url; > std::string key_access_token; > std::unordered_map<std::string, std::string> custom_kms_conf; > }; > > struct EncryptionConfig { > std::string column_keys; > std::string footer_key; > std::string encryption_algorithm; > }; > > Cheers, Gidon > > > ---------- Forwarded message --------- > From: Gidon Gershinsky <gg5...@gmail.com> > Date: Tue, Jul 7, 2020 at 9:35 AM > Subject: Property-driven Parquet encryption > To: dev <dev@arrow.apache.org> > Cc: tham <t...@emotiv.com> > > > Hi all, > > We are working on the Parquet modular encryption, and are currently adding > a high-level interface that allows to encrypt/decrypt parquet files via > properties only (without calling the low level API). In the > spark/parquet-mr domain, we're using the Hadoop configuration properties > for that purpose - they are already passed from Spark to Parquet, and allow > to add custom key-value properties that can carry the list of encrypted > columns, key identities etc, as described in the > > https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing > > I'm not sufficiently familiar with the pandas/pyarrow/parquet-cpp > ecosystem. Is there an analog of Hadoop configuration (a free key-value > map, passed all the way down to parquet-cpp)? Or a more structured > configuration object (where we'll need to add the encryption-related > properties)? All suggestions are welcome. > > Cheers, Gidon >