ggershinsky commented on a change in pull request #32895: URL: https://github.com/apache/spark/pull/32895#discussion_r672359862
########## File path: docs/sql-data-sources-parquet.md ########## @@ -252,6 +252,71 @@ REFRESH TABLE my_table; </div> +## Columnar Encryption + + +Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12+. + +Parquet uses the envelope encryption practice, where file parts are encrypted with "data encryption keys" (DEKs), and the DEKs are encrypted with "master encryption keys" (MEKs). The DEKs are randomly generated by Parquet for each encrypted file/column. The MEKs are generated, stored and managed in a Key Management Service (KMS) of user’s choice. The Parquet Maven [repository](https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/) has a jar with a mock KMS implementation that allows to run column encryption and decryption using a spark-shell only, without deploying a KMS server (download the `parquet-hadoop-tests.jar` file and place it in the Spark `jars` folder): + +<div data-lang="scala" markdown="1"> +{% highlight scala %} + +sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" , + "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") + +// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS +sc.hadoopConfiguration.set("parquet.encryption.key.list" , + "keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==") + +// Activate Parquet encryption, driven by Hadoop properties +sc.hadoopConfiguration.set("parquet.crypto.factory.class" , + "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") + +// Write encrypted dataframe files. +// Column "square" will be protected with master key "keyA". +// Parquet file footers will be protected with master key "keyB" +squaresDF.write. + option("parquet.encryption.column.keys" , "keyA:square"). + option("parquet.encryption.footer.key" , "keyB"). +parquet("/path/to/table.parquet.encrypted") + +// Read encrypted dataframe files +val df2 = spark.read.parquet("/path/to/table.parquet.encrypted") + +{% endhighlight %} + +</div> + + +#### KMS Client Review comment: Mm, a good point. Thinking of this, I'm not aware of any other analytic framework where Parquet encryption is (or can be) activated this way. While this approach is supposed to be general, it was designed and tested within Spark. In other frameworks, updating parquet to 1.12.0 is not sufficient, they need to call low-level Parquet APIs to leverage the encryption feature. Still, I agree it would be good to document Parquet APIs (general; not only encryption). However, there is a high chance such documentation simply doesn't exist.. At least I couldn't find anything, besides a page with the parquet-hadoop parameters.. Given these two points, I believe it is reasonable to add this section in the Spark documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
