[spark] branch master updated: [SPARK-37956][DOCS] Add Python and Java examples of Parquet encryption in Spark SQL to documentation

srowen Sat, 21 May 2022 17:35:24 -0700

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 9f91b82b721 [SPARK-37956][DOCS] Add Python and Java examples of 
Parquet encryption in Spark SQL to documentation
9f91b82b721 is described below

commit 9f91b82b7210b95b7ac911ff65af735d1996b337
Author: Maya Anderson <[email protected]>
AuthorDate: Sat May 21 19:34:41 2022 -0500

    [SPARK-37956][DOCS] Add Python and Java examples of Parquet encryption in 
Spark SQL to documentation
    
    ### What changes were proposed in this pull request?
    Add Java and Python examples based on the Scala example for the use of 
Parquet encryption in Spark.
    
    ### Why are the changes needed?
    To provide information on how to use Parquet column encryption in Spark SQL 
in additional languages: in Python and Java, based on the Scala example.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, it adds Parquet encryption usage examples in Python and Java to the 
documentation, which currently provides only a scala example.
    
    ### How was this patch tested?
    SKIP_API=1 bundle exec jekyll build
    
    Python tab:
    
![image](https://user-images.githubusercontent.com/63074550/168066896-3078b861-e70c-4469-9712-5045a4b2d772.png)
    
    Java tab:
    
![image](https://user-images.githubusercontent.com/63074550/168051644-9c181f67-4baa-42e5-9a3c-57ac0a10b540.png)
    
    CC ggershinsky
    
    Closes #36523 from andersonm-ibm/pme_docs.
    
    Authored-by: Maya Anderson <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
---
 docs/sql-data-sources-parquet.md | 58 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/docs/sql-data-sources-parquet.md b/docs/sql-data-sources-parquet.md
index 0d97baf1e3a..0c3b5cc6156 100644
--- a/docs/sql-data-sources-parquet.md
+++ b/docs/sql-data-sources-parquet.md
@@ -259,6 +259,8 @@ Since Spark 3.2, columnar encryption is supported for 
Parquet tables with Apache
 
 Parquet uses the envelope encryption practice, where file parts are encrypted 
with "data encryption keys" (DEKs), and the DEKs are encrypted with "master 
encryption keys" (MEKs). The DEKs are randomly generated by Parquet for each 
encrypted file/column. The MEKs are generated, stored and managed in a Key 
Management Service (KMS) of user’s choice. The Parquet Maven 
[repository](https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/)
 has a jar with a mock KMS implementati [...]
 
+<div class="codetabs">
+
 <div data-lang="scala"  markdown="1">
 {% highlight scala %}
 
@@ -288,6 +290,62 @@ val df2 = 
spark.read.parquet("/path/to/table.parquet.encrypted")
 
 </div>
 
+<div data-lang="java"  markdown="1">
+{% highlight java %}
+
+sc.hadoopConfiguration().set("parquet.encryption.kms.client.class" ,
+   "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS");
+
+// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
+sc.hadoopConfiguration().set("parquet.encryption.key.list" ,
+   "keyA:AAECAwQFBgcICQoLDA0ODw== ,  keyB:AAECAAECAAECAAECAAECAA==");
+
+// Activate Parquet encryption, driven by Hadoop properties
+sc.hadoopConfiguration().set("parquet.crypto.factory.class" ,
+   "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory");
+
+// Write encrypted dataframe files.
+// Column "square" will be protected with master key "keyA".
+// Parquet file footers will be protected with master key "keyB"
+squaresDF.write().
+   option("parquet.encryption.column.keys" , "keyA:square").
+   option("parquet.encryption.footer.key" , "keyB").
+   parquet("/path/to/table.parquet.encrypted");
+
+// Read encrypted dataframe files
+Dataset<Row> df2 = spark.read().parquet("/path/to/table.parquet.encrypted");
+
+{% endhighlight %}
+
+</div>
+
+<div data-lang="python"  markdown="1">
+{% highlight python %}
+
+# Set hadoop configuration properties, e.g. using configuration properties of
+# the Spark job:
+# --conf spark.hadoop.parquet.encryption.kms.client.class=\
+#           "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS"\
+# --conf spark.hadoop.parquet.encryption.key.list=\
+#           "keyA:AAECAwQFBgcICQoLDA0ODw== ,  keyB:AAECAAECAAECAAECAAECAA=="\
+# --conf spark.hadoop.parquet.crypto.factory.class=\
+#           "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory"
+
+# Write encrypted dataframe files.
+# Column "square" will be protected with master key "keyA".
+# Parquet file footers will be protected with master key "keyB"
+squaresDF.write\
+   .option("parquet.encryption.column.keys" , "keyA:square")\
+   .option("parquet.encryption.footer.key" , "keyB")\
+   .parquet("/path/to/table.parquet.encrypted")
+
+# Read encrypted dataframe files
+df2 = spark.read.parquet("/path/to/table.parquet.encrypted")
+
+{% endhighlight %}
+
+</div>
+</div>
 
 #### KMS Client
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch master updated: [SPARK-37956][DOCS] Add Python and Java examples of Parquet encryption in Spark SQL to documentation

Reply via email to