Re: Property-driven Parquet encryption

Micah Kornfield Thu, 09 Jul 2020 22:03:12 -0700

Hi Gidon,
A couple of questions on the proposed structs:
1. "kms_client_class"
This sounds like it might be a very Java centric approach.  Have you given
consideration to how this can be used in C++/Python?  Should I just RTFD
(which one)?


2.  std::unordered_map<std::string, std::string> custom_kms_conf;
Is uniqueness of "keys" intentional (i.e. why not std::vector<std::pair<>>)?

3.  It seems like a slightly asymmetric API to have key-value pairs
separately  for custom_kms_conf and packing all column key metadata into a
serialized string.  Is there a reason for this?

Thanks,
Micah




On Wed, Jul 8, 2020 at 11:06 PM Gidon Gershinsky <gg5...@gmail.com> wrote:

> Ok, so we had a look with Tham at the current pyarrow and parquet-cpp
> configuration objects. There is no Hadoop-like free map (this is good, I
> guess). Instead, the property keys are pre-defined in most objects.
>
> But some objects (such as  HdfsConnectionConfig ,
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.h#L87)
> have a number of pre-defined keys - and a free string-to-string map,
> `extra_conf`. This approach is a good fit for us, because we build tools
> that allow to work with different external KMS's (encryption Key Management
> Services). Each KMS requires a custom client that will connect
> parquet encryption to the KMS server. We provide an interface for such
> clients; many properties are pre-defined, but the custom client
> implementations will require custom properties. We'll define configuration
> objects that will look like this:
>
> struct KmsConnectionConfig {
>     std::string kms_client_class;
>     std::string kms_instance_id;
>     std::string kms_instance_url;
>     std::string key_access_token;
>     std::unordered_map<std::string, std::string> custom_kms_conf;
> };
>
> struct EncryptionConfig {
>     std::string column_keys;
>     std::string footer_key;
>     std::string encryption_algorithm;
> };
>
> Cheers, Gidon
>
>
> ---------- Forwarded message ---------
> From: Gidon Gershinsky <gg5...@gmail.com>
> Date: Tue, Jul 7, 2020 at 9:35 AM
> Subject: Property-driven Parquet encryption
> To: dev <dev@arrow.apache.org>
> Cc: tham <t...@emotiv.com>
>
>
> Hi all,
>
> We are working on the Parquet modular encryption, and are currently adding
> a high-level interface that allows to encrypt/decrypt parquet files via
> properties only (without calling the low level API). In the
> spark/parquet-mr domain, we're using the Hadoop configuration properties
> for that purpose - they are already passed from Spark to Parquet, and allow
> to add custom key-value properties that can carry the list of encrypted
> columns, key identities etc, as described in the
>
> https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing
>
> I'm not sufficiently familiar with the pandas/pyarrow/parquet-cpp
> ecosystem. Is there an analog of Hadoop configuration (a free key-value
> map, passed all the way down to parquet-cpp)? Or a more structured
> configuration object (where we'll need to add the encryption-related
> properties)? All suggestions are welcome.
>
> Cheers, Gidon
>

Re: Property-driven Parquet encryption

Reply via email to