changhu-m commented on issue #47435:
URL: https://github.com/apache/arrow/issues/47435#issuecomment-3237969586

   Thanks @adamreeve @rok .
   
   Our use case is the following: 
   
   Encryption side:
   
   * A parquet file is encrypted with a DEK.
   * The DEK is encrypted with KEK.
   * The encrypted DEK and related metadata (i.e., KMS info to get KEK) is 
printed as JSON in a metadata file.
   * The parquet file and metadata file are uploaded to a cloud location.
   
   Our C++ code, as mentioned, is: 
   
   ```c++
     std::string keyStr(reinterpret_cast<const char*>(key.data()), key.size());
   
     auto fileEncryptionProps = 
parquet::FileEncryptionProperties::Builder(keyStr)
                                    
.algorithm(parquet::ParquetCipher::AES_GCM_V1)
                                    ->build();
   
     const auto props = parquet::WriterProperties::Builder()
                            .encryption(fileEncryptionProps)
                            ->build();
   
     const auto fields = convertSchema(table->schema());
     const auto schemaNode = 
std::static_pointer_cast<parquet::schema::GroupNode>(
         parquet::schema::GroupNode::Make(
             "schema", parquet::Repetition::REQUIRED, fields));
     auto schema =
         std::static_pointer_cast<parquet::schema::GroupNode>(schemaNode);
   
     // Open output file
     std::shared_ptr<arrow::io::FileOutputStream> outFile;
     auto result = arrow::io::FileOutputStream::Open(outputFilePath);
     outFile = result.ValueOrDie();
   
     auto parquetStreamWriter = make_unique<parquet::StreamWriter>(
         parquet::ParquetFileWriter::Open(outFile, schema, props));
   
     return writeToStreamWriter(table, *(parquetStreamWriter.get()));
   ```
   
   Decryption side: 
   
   * Customers get the KEK using the metadata sent.
   * Customers decrypt the DEK.
   * Customer sets up parquet decryption with the DEK.
   
   The Python code is: 
   
   ```Python
   import pyarrow.parquet.encryption as pe
   
   # Create a simple KMS client that returns our DEK
   class SimpleKmsClient(pe.KmsClient):
       def __init__(self):
           pe.KmsClient.__init__(self)
       
       def unwrap_key(self, wrapped_key, master_key_identifier):
           return dek
       
       def wrap_key(self, key_bytes, master_key_identifier):
           raise NotImplementedError("wrap_key not needed for decryption")
   
   # Create KMS factory
   def kms_factory(kms_connection_configuration):
       return SimpleKmsClient()
   
   crypto_factory = pe.CryptoFactory(kms_factory)
   
   # Simple decryption config
   decryption_config = pe.DecryptionConfiguration()
   kms_connection_config = pe.KmsConnectionConfig()
   
   # Create file decryption properties
   file_decryption_props = crypto_factory.file_decryption_properties(
       kms_connection_config, decryption_config
   )
   
   # Read the file
   table = pq.read_table(path, decryption_properties=file_decryption_props)
   
   ```
   
   
   And the Spark code is:
   
   ```Python
       
       # Create Spark session with encryption support
       spark = SparkSession.builder \
           .appName("ParquetModularEncryption") \
           .config('spark.jars', 'parquet-hadoop-1.13.1-tests.jar')\
           .config("spark.sql.adaptive.enabled", "true") \
           .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
           .config("spark.hadoop.parquet.encryption.kms.client.class", 
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") \
           .config("spark.hadoop.parquet.encryption.key.list", "k1:{DEK}") \
           .config("spark.hadoop.parquet.crypto.factory.class", 
"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") \
           .getOrCreate()
   
   
       # Read the encrypted parquet file (automatically decrypted with Hadoop 
config)
       df = spark.read.parquet(encrypted_path)
   
   ```
   
   Right now, with the existing Python API, I have to set some dummy metadata 
on the encryption side to satisfy it, so the encryption code becomes: 
   
   ```c++
     std::string keyStr(reinterpret_cast<const char*>(key.data()), key.size());
   
     // Set up key metadata
     folly::dynamic metadata = folly::dynamic::object("keyMaterialType", 
"PKMT1")(
         "internalStorage", true)("isFooterKey", true)("doubleWrapping", false)(
         "kmsInstanceID", "dummy_kms_instance_id")(
         "kmsInstanceURL", "dummy_kms_instance_url")(
         "masterKeyID", "dummy_master_key_id")("wrappedDEK", 
"dummy_wrapped_dek");
   
     const std::string& footerKeyMetadata = folly::toJson(metadata);
   
     auto fileEncryptionProps = 
parquet::FileEncryptionProperties::Builder(keyStr)
                                    
.algorithm(parquet::ParquetCipher::AES_GCM_V1)
                                    ->footer_key_metadata(footerKeyMetadata)
                                    ->build();
   
     const auto props = parquet::WriterProperties::Builder()
                            .encryption(fileEncryptionProps)
                            ->build();
   
     const auto fields = convertSchema(table->schema());
     const auto schemaNode = 
std::static_pointer_cast<parquet::schema::GroupNode>(
         parquet::schema::GroupNode::Make(
             "schema", parquet::Repetition::REQUIRED, fields));
     auto schema =
         std::static_pointer_cast<parquet::schema::GroupNode>(schemaNode);
   
     // Open output file
     std::shared_ptr<arrow::io::FileOutputStream> outFile;
     auto result = arrow::io::FileOutputStream::Open(outputFilePath);
     outFile = result.ValueOrDie();
   
     auto parquetStreamWriter = make_unique<parquet::StreamWriter>(
         parquet::ParquetFileWriter::Open(outFile, schema, props));
   
     return writeToStreamWriter(table, *(parquetStreamWriter.get()));
   ```
   
   
   Because of this workaround, we are not blocked.
   
   If we want to keep using `table = pq.read_table(path, 
decryption_properties=file_decryption_props)`, is there a way to identify 
through `decryption_properties` that a DEK is already provided, and that no 
more metadata is required?  Right now, metadata parsing is required before DEK 
retrieval.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to