Parquet columnar encryption supports these types. Currently, it requires an
explicit full path for each column to be encrypted.
Your sample will work with
*spark.sparkContext.hadoopConfiguration.set("parquet.encryption.column.keys",
"k2:rider.list.element.foo,rider.list.element.bar")*

Having said that, there are a couple of things that can be improved (thank
you for running these checks!)

- the exception text is not informative enough, doesn't help much in
correcting the parameters. I've opened a Jira for this (and for updating
the parameter documentation).
The goal is to make the exception print something like:
*Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException:
Encrypted column [rider] not in file schema column list: [foo] ,
[rider.list.element.foo] , [rider.list.element.bar] , [ts] , [uuid]*

- Configuring a key for all children of a nested schema node (eg "
*k2:rider.*"*). This had been discussed in the past, but not followed up..
Is this something you'd be interested to build? Alternatively, I can do it,
but this will take me a while to get to.


Cheers, Gidon


On Sat, Oct 29, 2022 at 12:45 AM nicolas paris <[email protected]>
wrote:

> Hello,
>
> apparently, modular encryption does not yet support **arrays** types.
>
> ```scala
> spark.sparkContext.hadoopConfiguration.set("parquet.crypto.factory.class",
> "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.kms.client.class"
> , "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.key.list",
> "k1:AAECAwQFBgcICQoLDA0ODw==, k2:AAECAAECAAECAAECAAECAA==")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.plaintext.footer",
> "true")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.footer.key",
> "k1")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.column.keys",
> "k2:rider")
>
> val df = spark.sql("select 1 as foo, array(named_struct('foo',2, 'bar',3))
> as rider, 3 as ts, uuid() as uuid")
> df.write.format("parquet").mode("overwrite").save("/tmp/enc")
>
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException:
> Encrypted column [rider] not in file schema
>
> ```
>
> also, the doted columnpath would not support to encrypt within nested
> structure mixed with arrays. For example, there is no way I am aware of to
> target "all foo in rider".
>
> ```
> root
>  |-- foo: integer (nullable = false)
>  |-- rider: array (nullable = false)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- foo: integer (nullable = false)
>  |    |    |-- bar: integer (nullable = false)
>  |-- ts: integer (nullable = false)
>  |-- uuid: string (nullable = false)
> ```
>
> so far, those two issues makes arrays of confidential information
> impossible to encrypt, or am I missing something ?
>
> Thanks,
>

Reply via email to