ggershinsky commented on a change in pull request #32895: URL: https://github.com/apache/spark/pull/32895#discussion_r650901713
########## File path: docs/sql-data-sources-parquet.md ########## @@ -252,6 +252,51 @@ REFRESH TABLE my_table; </div> +## Columnar Encryption + + +Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12. + +Parquet uses the envelope encryption practice, where the file parts are encrypted with “data encryption keys” (DEKs), and the DEKs are encrypted with “master encryption keys” (MEKs). The DEKs are randomly generated by Parquet for each encrypted file/column. The MEKs are generated, stored and managed in a Key Management Service (KMS) of user’s choice. Parquet-test [package](https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar) has a mock KMS implementation that allows to run column encryption and decryption without a KMS server: Review comment: There are lots of open source and public cloud KMS systems. There is even a larger number of proprietary KMSs, deployed in platforms of different companies. Also, the KMS APIs and protocols are known to change rather frequently (surely faster than the Parquet release cycle). Given these reasons, the Parquet community has decided not to release and support a client for any particular KMS. Instead, we've defined a plug-in KmsClient interface that can be used for implementing a client for any public or private KMS system. > the audience of this document includes not only a Spark developer but also a general end user. That's a good point. I don't think that general end users, without a basic data security expertise, should use Parquet encryption directly. I'll add a text that explains that this document (and the API) is for developers only, who build a production-grade data security platform in their organization, and understand the key / user identity / access management. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
