ggershinsky commented on a change in pull request #32895: URL: https://github.com/apache/spark/pull/32895#discussion_r672232360
########## File path: docs/sql-data-sources-parquet.md ########## @@ -252,6 +252,71 @@ REFRESH TABLE my_table; </div> +## Columnar Encryption + + +Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12+. + +Parquet uses the envelope encryption practice, where file parts are encrypted with "data encryption keys" (DEKs), and the DEKs are encrypted with "master encryption keys" (MEKs). The DEKs are randomly generated by Parquet for each encrypted file/column. The MEKs are generated, stored and managed in a Key Management Service (KMS) of user’s choice. The Parquet Maven [repository](https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/) has a jar with a mock KMS implementation that allows to run column encryption and decryption using a spark-shell only, without deploying a KMS server (download the `parquet-hadoop-tests.jar` file and place it in the Spark `jars` folder): + +<div data-lang="scala" markdown="1"> +{% highlight scala %} + +sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" , + "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") + +// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS +sc.hadoopConfiguration.set("parquet.encryption.key.list" , + "keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==") + +// Activate Parquet encryption, driven by Hadoop properties +sc.hadoopConfiguration.set("parquet.crypto.factory.class" , + "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") + +// Write encrypted dataframe files. +// Column "square" will be protected with master key "keyA". +// Parquet file footers will be protected with master key "keyB" +squaresDF.write. + option("parquet.encryption.column.keys" , "keyA:square"). + option("parquet.encryption.footer.key" , "keyB"). +parquet("/path/to/table.parquet.encrypted") + +// Read encrypted dataframe files +val df2 = spark.read.parquet("/path/to/table.parquet.encrypted") + +{% endhighlight %} + +</div> + + +#### KMS Client Review comment: To expand more on this point - removing this section would mean removal of the previous section too, since it also uses the "Hello World" example. Therefore, it removes the full content of this pull request, comprised of the two sections. I agree it would be reasonable to move some of the content to a Parquet documentation site, but such site doesn't exist.., AFAIK Parquet doesn't have API documentation (keeps only a page on its Hadoop parameters). I realize this section/PR seems to be somewhat unusual compared to other Spark/Parquet doc sections, but there is a simple reason. Encryption is somewhat unusual compared to other Parquet features. To be really useful, it requires more than just Hadoop parameters. It has an API, or more specifically, an interface for custom KMS Client classes tailored for user-specific KMS/IAM systems, deployed in their organizations. Providing any such classes in Parquet or Spark packages will be totally pointless, as detailed in other comments. Therefore, the approach taken here, is to provide a simple to understand "Hello World" KmsClient class, which is also easy to experiment with (since it runs alone and doesn't require a real KMS Server). Followed by an explanation about how to take the next step and develop a real-life KmsClient. This should provide sufficient documentation for new adopters of the Spark/ParquetEncryption capability, which is already used in numerous deployments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
