ggershinsky commented on a change in pull request #32895: URL: https://github.com/apache/spark/pull/32895#discussion_r669693709
########## File path: docs/sql-data-sources-parquet.md ########## @@ -252,6 +252,71 @@ REFRESH TABLE my_table; </div> +## Columnar Encryption + + +Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12+. + +Parquet uses the envelope encryption practice, where file parts are encrypted with "data encryption keys" (DEKs), and the DEKs are encrypted with "master encryption keys" (MEKs). The DEKs are randomly generated by Parquet for each encrypted file/column. The MEKs are generated, stored and managed in a Key Management Service (KMS) of user’s choice. Parquet maven [repository]( https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/) has a jar with a mock KMS implementation that allows to run column encryption and decryption using a spark-shell only, without deploying a KMS server (download the `parquet-hadoop-tests.jar` file and place it in the Spark `jars` folder): + +<div data-lang="scala" markdown="1"> +{% highlight scala %} + +sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" , + "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") + +// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS +sc.hadoopConfiguration.set("parquet.encryption.key.list" , + "keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==") + +// Activate Parquet encryption, driven by Hadoop properties +sc.hadoopConfiguration.set("parquet.crypto.factory.class" , + "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") + +// Write encrypted dataframe files. +// Column "square" will be protected with master key "keyA". +// Parquet file footers will be protected with master key "keyB" +squaresDF.write. + option("parquet.encryption.column.keys" , "keyA:square"). + option("parquet.encryption.footer.key" , "keyB"). +parquet("/path/to/table.parquet.encrypted") + +// Read encrypted dataframe files +val df2 = spark.read.parquet("/path/to/table.parquet.encrypted") + +{% endhighlight %} + +</div> + + +#### KMS Client + +The InMemoryKMS class is provided only for illustration and simple demonstration of Parquet encryption functionality. **It should not be used in a real deployment**. The master encryption keys must be kept and managed in a production-grade KMS system, deployed in user's organization. Rollout of Spark with Parquet encryption requires implementation of a client class for the KMS server. Parquet provides a plug-in [interface](https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KmsClient.java) for development of such classes, + +<div data-lang="java" markdown="1"> +{% highlight java %} + +public interface KmsClient { + // Wraps a key - encrypts it with the master key. + public String wrapKey(byte[] keyBytes, String masterKeyIdentifier); + + // Decrypts (unwraps) a key with the master key. + public byte[] unwrapKey(String wrappedKey, String masterKeyIdentifier); + + // Use of initialization parameters is optional. + public void initialize(Configuration configuration, String kmsInstanceID, + String kmsInstanceURL, String accessToken); +} + +{% endhighlight %} + +</div> + +An [example](https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java) of such class for an open source [KMS](https://www.vaultproject.io/api/secret/transit) can be found in the parquet-mr repository. The production KMS client should be designed in cooperation with organization's security administrators, and built by developers with an experience in access control management. Once such class is created, it can be passed to applications via the `parquet.encryption.kms.client.class` parameter and leveraged by general Spark users as shown in the encrypted dataframe write/read sample above. Review comment: Looks like I've misunderstood. I thought you refer to these hyperlinks : - https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java For the latest version, this can be replaced with a link to the master branch: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java - https://www.vaultproject.io/api/secret/transit I think this is their latest version As for adding this information in https://parquet.apache.org/documentation/latest/ - there are a number of problems with it: this page doesn't keep this kind of technical details, on encryption or other Parquet features; this page is a few years old and doesn't mention the recent Parquet features; I don't have edit rights to it - updating this page (if possible at all) will be a long project in the community, without a chance to be on time for Spark 3.2.0 release. This is about hyperlinks to a sample KMS client/server. Another option would be simply to remove the URLs, and leave only the names (Hashicorp Vault; Parquet VaultClient); this should be enough for developers to find them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
