[GitHub] [spark] ggershinsky commented on a change in pull request #32895: [SPARK-35658][DOCS] Document Parquet encryption feature in Spark SQL

GitBox Tue, 20 Jul 2021 04:41:54 -0700


ggershinsky commented on a change in pull request #32895:
URL: https://github.com/apache/spark/pull/32895#discussion_r672232360




##########
File path: docs/sql-data-sources-parquet.md
##########
@@ -252,6 +252,71 @@ REFRESH TABLE my_table;
 
 </div>
 
+## Columnar Encryption
+
+
+Since Spark 3.2, columnar encryption is supported for Parquet tables with 
Apache Parquet 1.12+.
+
+Parquet uses the envelope encryption practice, where file parts are encrypted 
with "data encryption keys" (DEKs), and the DEKs are encrypted with "master 
encryption keys" (MEKs). The DEKs are randomly generated by Parquet for each 
encrypted file/column. The MEKs are generated, stored and managed in a Key 
Management Service (KMS) of user’s choice. The Parquet Maven 
[repository](https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/)
 has a jar with a mock KMS implementation that allows to run column encryption 
and decryption using a spark-shell only, without deploying a KMS server 
(download the `parquet-hadoop-tests.jar` file and place it in the Spark `jars` 
folder):
+
+<div data-lang="scala"  markdown="1">
+{% highlight scala %}
+
+sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
+                           
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
+
+// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
+sc.hadoopConfiguration.set("parquet.encryption.key.list" ,
+                   "keyA:AAECAwQFBgcICQoLDA0ODw== ,  
keyB:AAECAAECAAECAAECAAECAA==")
+
+// Activate Parquet encryption, driven by Hadoop properties
+sc.hadoopConfiguration.set("parquet.crypto.factory.class" ,
+                   
"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
+
+// Write encrypted dataframe files. 
+// Column "square" will be protected with master key "keyA".
+// Parquet file footers will be protected with master key "keyB"
+squaresDF.write.
+   option("parquet.encryption.column.keys" , "keyA:square").
+   option("parquet.encryption.footer.key" , "keyB").
+parquet("/path/to/table.parquet.encrypted")
+
+// Read encrypted dataframe files
+val df2 = spark.read.parquet("/path/to/table.parquet.encrypted")
+
+{% endhighlight %}
+
+</div>
+
+
+#### KMS Client

Review comment:
       To expand more on this point - removing this section would mean removal 
of the previous section too, since it also uses the "Hello World" example. 
Therefore, it removes the full content of this pull request, comprised of the 
two sections.
   I agree it would be reasonable to move some of the content to a Parquet 
documentation site, but such site doesn't exist.., AFAIK Parquet doesn't have 
API documentation (keeps only a page on its Hadoop parameters).
   
   I realize this section/PR seems to be somewhat unusual compared to other 
Spark/Parquet doc sections, but there is a simple reason. Encryption is 
somewhat unusual compared to other Parquet features. To be really useful, it 
requires more than just Hadoop parameters. It has an API, or more specifically, 
an interface for custom KMS Client classes tailored for user-specific KMS/IAM 
systems, deployed in their organizations. Providing any such classes in Parquet 
or Spark packages will be totally pointless, as detailed in other comments. 
   
   Therefore, the approach taken here, is to provide a simple to understand 
"Hello World" KmsClient class, which is also easy to experiment with (since it 
runs alone and doesn't require a real KMS Server). Followed by an explanation 
about how to take the next step and develop a real-life KmsClient. This should 
provide sufficient documentation for new adopters of the 
Spark/ParquetEncryption capability, which is already used in numerous 
deployments.

##########
File path: docs/sql-data-sources-parquet.md
##########
@@ -252,6 +252,71 @@ REFRESH TABLE my_table;
 
 </div>
 
+## Columnar Encryption
+
+
+Since Spark 3.2, columnar encryption is supported for Parquet tables with 
Apache Parquet 1.12+.
+
+Parquet uses the envelope encryption practice, where file parts are encrypted 
with "data encryption keys" (DEKs), and the DEKs are encrypted with "master 
encryption keys" (MEKs). The DEKs are randomly generated by Parquet for each 
encrypted file/column. The MEKs are generated, stored and managed in a Key 
Management Service (KMS) of user’s choice. Parquet maven [repository]( 
https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/) has a 
jar with a mock KMS implementation that allows to run column encryption and 
decryption using a spark-shell only, without deploying a KMS server (download 
the `parquet-hadoop-tests.jar` file and place it in the Spark `jars` folder):
+
+<div data-lang="scala"  markdown="1">
+{% highlight scala %}
+
+sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
+                           
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
+
+// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
+sc.hadoopConfiguration.set("parquet.encryption.key.list" ,
+                   "keyA:AAECAwQFBgcICQoLDA0ODw== ,  
keyB:AAECAAECAAECAAECAAECAA==")
+
+// Activate Parquet encryption, driven by Hadoop properties
+sc.hadoopConfiguration.set("parquet.crypto.factory.class" ,
+                   
"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
+
+// Write encrypted dataframe files. 
+// Column "square" will be protected with master key "keyA".
+// Parquet file footers will be protected with master key "keyB"
+squaresDF.write.
+   option("parquet.encryption.column.keys" , "keyA:square").
+   option("parquet.encryption.footer.key" , "keyB").
+parquet("/path/to/table.parquet.encrypted")
+
+// Read encrypted dataframe files
+val df2 = spark.read.parquet("/path/to/table.parquet.encrypted")
+
+{% endhighlight %}
+
+</div>
+
+
+#### KMS Client
+
+The InMemoryKMS class is provided only for illustration and simple 
demonstration of Parquet encryption functionality. **It should not be used in a 
real deployment**. The master encryption keys must be kept and managed in a 
production-grade KMS system, deployed in user's organization. Rollout of Spark 
with Parquet encryption requires implementation of a client class for the KMS 
server. Parquet provides a plug-in 
[interface](https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KmsClient.java)
 for development of such classes,
+
+<div data-lang="java"  markdown="1">
+{% highlight java %}
+
+public interface KmsClient {
+  // Wraps a key - encrypts it with the master key.
+  public String wrapKey(byte[] keyBytes, String masterKeyIdentifier);
+
+  // Decrypts (unwraps) a key with the master key. 
+  public byte[] unwrapKey(String wrappedKey, String masterKeyIdentifier);
+
+  // Use of initialization parameters is optional.
+  public void initialize(Configuration configuration, String kmsInstanceID, 
+                         String kmsInstanceURL, String accessToken);
+}
+
+{% endhighlight %}
+
+</div>
+
+An 
[example](https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java)
 of such class for an open source 
[KMS](https://www.vaultproject.io/api/secret/transit) can be found in the 
parquet-mr repository. The production KMS client should be designed in 
cooperation with organization's security administrators, and built by 
developers with an experience in access control management. Once such class is 
created, it can be passed to applications via the 
`parquet.encryption.kms.client.class` parameter and leveraged by general Spark 
users as shown in the encrypted dataframe write/read sample above.

Review comment:
       @srowen will one of these two options be ok for you? (1. removing the 
URLs, keeping the names of the server/client; or 2. keeping the URLs to the 
latest versions of the server/client)

##########
File path: docs/sql-data-sources-parquet.md
##########
@@ -252,6 +252,71 @@ REFRESH TABLE my_table;
 
 </div>
 
+## Columnar Encryption
+
+
+Since Spark 3.2, columnar encryption is supported for Parquet tables with 
Apache Parquet 1.12+.
+
+Parquet uses the envelope encryption practice, where file parts are encrypted 
with "data encryption keys" (DEKs), and the DEKs are encrypted with "master 
encryption keys" (MEKs). The DEKs are randomly generated by Parquet for each 
encrypted file/column. The MEKs are generated, stored and managed in a Key 
Management Service (KMS) of user’s choice. Parquet maven [repository]( 
https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/) has a 
jar with a mock KMS implementation that allows to run column encryption and 
decryption using a spark-shell only, without deploying a KMS server (download 
the `parquet-hadoop-tests.jar` file and place it in the Spark `jars` folder):
+
+<div data-lang="scala"  markdown="1">
+{% highlight scala %}
+
+sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
+                           
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
+
+// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
+sc.hadoopConfiguration.set("parquet.encryption.key.list" ,
+                   "keyA:AAECAwQFBgcICQoLDA0ODw== ,  
keyB:AAECAAECAAECAAECAAECAA==")
+
+// Activate Parquet encryption, driven by Hadoop properties
+sc.hadoopConfiguration.set("parquet.crypto.factory.class" ,
+                   
"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
+
+// Write encrypted dataframe files. 
+// Column "square" will be protected with master key "keyA".
+// Parquet file footers will be protected with master key "keyB"
+squaresDF.write.
+   option("parquet.encryption.column.keys" , "keyA:square").
+   option("parquet.encryption.footer.key" , "keyB").
+parquet("/path/to/table.parquet.encrypted")
+
+// Read encrypted dataframe files
+val df2 = spark.read.parquet("/path/to/table.parquet.encrypted")
+
+{% endhighlight %}
+
+</div>
+
+
+#### KMS Client
+
+The InMemoryKMS class is provided only for illustration and simple 
demonstration of Parquet encryption functionality. **It should not be used in a 
real deployment**. The master encryption keys must be kept and managed in a 
production-grade KMS system, deployed in user's organization. Rollout of Spark 
with Parquet encryption requires implementation of a client class for the KMS 
server. Parquet provides a plug-in 
[interface](https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KmsClient.java)
 for development of such classes,
+
+<div data-lang="java"  markdown="1">
+{% highlight java %}
+
+public interface KmsClient {
+  // Wraps a key - encrypts it with the master key.
+  public String wrapKey(byte[] keyBytes, String masterKeyIdentifier);
+
+  // Decrypts (unwraps) a key with the master key. 
+  public byte[] unwrapKey(String wrappedKey, String masterKeyIdentifier);
+
+  // Use of initialization parameters is optional.
+  public void initialize(Configuration configuration, String kmsInstanceID, 
+                         String kmsInstanceURL, String accessToken);
+}
+
+{% endhighlight %}
+
+</div>
+
+An 
[example](https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java)
 of such class for an open source 
[KMS](https://www.vaultproject.io/api/secret/transit) can be found in the 
parquet-mr repository. The production KMS client should be designed in 
cooperation with organization's security administrators, and built by 
developers with an experience in access control management. Once such class is 
created, it can be passed to applications via the 
`parquet.encryption.kms.client.class` parameter and leveraged by general Spark 
users as shown in the encrypted dataframe write/read sample above.

Review comment:
       SGTM

##########
File path: docs/sql-data-sources-parquet.md
##########
@@ -252,6 +252,71 @@ REFRESH TABLE my_table;
 
 </div>
 
+## Columnar Encryption
+
+
+Since Spark 3.2, columnar encryption is supported for Parquet tables with 
Apache Parquet 1.12+.
+
+Parquet uses the envelope encryption practice, where file parts are encrypted 
with "data encryption keys" (DEKs), and the DEKs are encrypted with "master 
encryption keys" (MEKs). The DEKs are randomly generated by Parquet for each 
encrypted file/column. The MEKs are generated, stored and managed in a Key 
Management Service (KMS) of user’s choice. The Parquet Maven 
[repository](https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/)
 has a jar with a mock KMS implementation that allows to run column encryption 
and decryption using a spark-shell only, without deploying a KMS server 
(download the `parquet-hadoop-tests.jar` file and place it in the Spark `jars` 
folder):
+
+<div data-lang="scala"  markdown="1">
+{% highlight scala %}
+
+sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
+                           
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
+
+// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
+sc.hadoopConfiguration.set("parquet.encryption.key.list" ,
+                   "keyA:AAECAwQFBgcICQoLDA0ODw== ,  
keyB:AAECAAECAAECAAECAAECAA==")
+
+// Activate Parquet encryption, driven by Hadoop properties
+sc.hadoopConfiguration.set("parquet.crypto.factory.class" ,
+                   
"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
+
+// Write encrypted dataframe files. 
+// Column "square" will be protected with master key "keyA".
+// Parquet file footers will be protected with master key "keyB"
+squaresDF.write.
+   option("parquet.encryption.column.keys" , "keyA:square").
+   option("parquet.encryption.footer.key" , "keyB").
+parquet("/path/to/table.parquet.encrypted")
+
+// Read encrypted dataframe files
+val df2 = spark.read.parquet("/path/to/table.parquet.encrypted")
+
+{% endhighlight %}
+
+</div>
+
+
+#### KMS Client

Review comment:
       Mm, a good point. Thinking of this, I'm not aware of any other analytic 
framework where Parquet encryption is (or can be) activated this way. While 
this approach is supposed to be general, it was designed and tested within 
Spark. In other frameworks, updating parquet to 1.12.0 is not sufficient, they 
need to call low-level Parquet APIs to leverage the encryption feature.
   
   Still, I agree it would be good to document Parquet APIs (general; not only 
encryption). However, there is a high chance such documentation simply doesn't 
exist.. At least I couldn't find anything, besides a page with the 
parquet-hadoop parameters..
   
   Given these two points, I believe it is reasonable to add this section in 
the Spark documentation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ggershinsky commented on a change in pull request #32895: [SPARK-35658][DOCS] Document Parquet encryption feature in Spark SQL

Reply via email to