Parquet columnar encryption supports these types. Currently, it requires an
explicit full path for each column to be encrypted.
Your sample will work with
*spark.sparkContext.hadoopConfiguration.set("parquet.encryption.column.keys",
"k2:rider.list.element.foo,rider.list.element.bar")*Having said that, there are a couple of things that can be improved (thank you for running these checks!) - the exception text is not informative enough, doesn't help much in correcting the parameters. I've opened a Jira for this (and for updating the parameter documentation). The goal is to make the exception print something like: *Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted column [rider] not in file schema column list: [foo] , [rider.list.element.foo] , [rider.list.element.bar] , [ts] , [uuid]* - Configuring a key for all children of a nested schema node (eg " *k2:rider.*"*). This had been discussed in the past, but not followed up.. Is this something you'd be interested to build? Alternatively, I can do it, but this will take me a while to get to. Cheers, Gidon On Sat, Oct 29, 2022 at 12:45 AM nicolas paris <[email protected]> wrote: > Hello, > > apparently, modular encryption does not yet support **arrays** types. > > ```scala > spark.sparkContext.hadoopConfiguration.set("parquet.crypto.factory.class", > "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") > spark.sparkContext.hadoopConfiguration.set("parquet.encryption.kms.client.class" > , "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") > spark.sparkContext.hadoopConfiguration.set("parquet.encryption.key.list", > "k1:AAECAwQFBgcICQoLDA0ODw==, k2:AAECAAECAAECAAECAAECAA==") > spark.sparkContext.hadoopConfiguration.set("parquet.encryption.plaintext.footer", > "true") > spark.sparkContext.hadoopConfiguration.set("parquet.encryption.footer.key", > "k1") > spark.sparkContext.hadoopConfiguration.set("parquet.encryption.column.keys", > "k2:rider") > > val df = spark.sql("select 1 as foo, array(named_struct('foo',2, 'bar',3)) > as rider, 3 as ts, uuid() as uuid") > df.write.format("parquet").mode("overwrite").save("/tmp/enc") > > Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: > Encrypted column [rider] not in file schema > > ``` > > also, the doted columnpath would not support to encrypt within nested > structure mixed with arrays. For example, there is no way I am aware of to > target "all foo in rider". > > ``` > root > |-- foo: integer (nullable = false) > |-- rider: array (nullable = false) > | |-- element: struct (containsNull = false) > | | |-- foo: integer (nullable = false) > | | |-- bar: integer (nullable = false) > |-- ts: integer (nullable = false) > |-- uuid: string (nullable = false) > ``` > > so far, those two issues makes arrays of confidential information > impossible to encrypt, or am I missing something ? > > Thanks, >
