changhu-m commented on issue #47435: URL: https://github.com/apache/arrow/issues/47435#issuecomment-3237969586
Thanks @adamreeve @rok . Our use case is the following: Encryption side: * A parquet file is encrypted with a DEK. * The DEK is encrypted with KEK. * The encrypted DEK and related metadata (i.e., KMS info to get KEK) is printed as JSON in a metadata file. * The parquet file and metadata file are uploaded to a cloud location. Our C++ code, as mentioned, is: ```c++ std::string keyStr(reinterpret_cast<const char*>(key.data()), key.size()); auto fileEncryptionProps = parquet::FileEncryptionProperties::Builder(keyStr) .algorithm(parquet::ParquetCipher::AES_GCM_V1) ->build(); const auto props = parquet::WriterProperties::Builder() .encryption(fileEncryptionProps) ->build(); const auto fields = convertSchema(table->schema()); const auto schemaNode = std::static_pointer_cast<parquet::schema::GroupNode>( parquet::schema::GroupNode::Make( "schema", parquet::Repetition::REQUIRED, fields)); auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(schemaNode); // Open output file std::shared_ptr<arrow::io::FileOutputStream> outFile; auto result = arrow::io::FileOutputStream::Open(outputFilePath); outFile = result.ValueOrDie(); auto parquetStreamWriter = make_unique<parquet::StreamWriter>( parquet::ParquetFileWriter::Open(outFile, schema, props)); return writeToStreamWriter(table, *(parquetStreamWriter.get())); ``` Decryption side: * Customers get the KEK using the metadata sent. * Customers decrypt the DEK. * Customer sets up parquet decryption with the DEK. The Python code is: ```Python import pyarrow.parquet.encryption as pe # Create a simple KMS client that returns our DEK class SimpleKmsClient(pe.KmsClient): def __init__(self): pe.KmsClient.__init__(self) def unwrap_key(self, wrapped_key, master_key_identifier): return dek def wrap_key(self, key_bytes, master_key_identifier): raise NotImplementedError("wrap_key not needed for decryption") # Create KMS factory def kms_factory(kms_connection_configuration): return SimpleKmsClient() crypto_factory = pe.CryptoFactory(kms_factory) # Simple decryption config decryption_config = pe.DecryptionConfiguration() kms_connection_config = pe.KmsConnectionConfig() # Create file decryption properties file_decryption_props = crypto_factory.file_decryption_properties( kms_connection_config, decryption_config ) # Read the file table = pq.read_table(path, decryption_properties=file_decryption_props) ``` And the Spark code is: ```Python # Create Spark session with encryption support spark = SparkSession.builder \ .appName("ParquetModularEncryption") \ .config('spark.jars', 'parquet-hadoop-1.13.1-tests.jar')\ .config("spark.sql.adaptive.enabled", "true") \ .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \ .config("spark.hadoop.parquet.encryption.kms.client.class", "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") \ .config("spark.hadoop.parquet.encryption.key.list", "k1:{DEK}") \ .config("spark.hadoop.parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") \ .getOrCreate() # Read the encrypted parquet file (automatically decrypted with Hadoop config) df = spark.read.parquet(encrypted_path) ``` Right now, with the existing Python API, I have to set some dummy metadata on the encryption side to satisfy it, so the encryption code becomes: ```c++ std::string keyStr(reinterpret_cast<const char*>(key.data()), key.size()); // Set up key metadata folly::dynamic metadata = folly::dynamic::object("keyMaterialType", "PKMT1")( "internalStorage", true)("isFooterKey", true)("doubleWrapping", false)( "kmsInstanceID", "dummy_kms_instance_id")( "kmsInstanceURL", "dummy_kms_instance_url")( "masterKeyID", "dummy_master_key_id")("wrappedDEK", "dummy_wrapped_dek"); const std::string& footerKeyMetadata = folly::toJson(metadata); auto fileEncryptionProps = parquet::FileEncryptionProperties::Builder(keyStr) .algorithm(parquet::ParquetCipher::AES_GCM_V1) ->footer_key_metadata(footerKeyMetadata) ->build(); const auto props = parquet::WriterProperties::Builder() .encryption(fileEncryptionProps) ->build(); const auto fields = convertSchema(table->schema()); const auto schemaNode = std::static_pointer_cast<parquet::schema::GroupNode>( parquet::schema::GroupNode::Make( "schema", parquet::Repetition::REQUIRED, fields)); auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(schemaNode); // Open output file std::shared_ptr<arrow::io::FileOutputStream> outFile; auto result = arrow::io::FileOutputStream::Open(outputFilePath); outFile = result.ValueOrDie(); auto parquetStreamWriter = make_unique<parquet::StreamWriter>( parquet::ParquetFileWriter::Open(outFile, schema, props)); return writeToStreamWriter(table, *(parquetStreamWriter.get())); ``` Because of this workaround, we are not blocked. If we want to keep using `table = pq.read_table(path, decryption_properties=file_decryption_props)`, is there a way to identify through `decryption_properties` that a DEK is already provided, and that no more metadata is required? Right now, metadata parsing is required before DEK retrieval. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org