jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1340143239


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +63,163 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : CryptoFactory
+            Factory for creating cryptographic instances.
+        kms_connection_config : KmsConnectionConfig

Review Comment:
   ```suggestion
           crypto_factory : pyarrow.parquet.encryption.CryptoFactory
               Factory for creating cryptographic instances.
           kms_connection_config : 
pyarrow.parquet.encryption.KmsConnectionConfig
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -678,10 +843,14 @@ cdef class 
ParquetFragmentScanOptions(FragmentScanOptions):
         If not None, override the maximum total size of containers allocated
         when decoding Thrift structures. The default limit should be
         sufficient for most Parquet files.
+    decryption_config : ParquetDecryptionConfig, default None

Review Comment:
   ```suggestion
       decryption_config : pyarrow.dataset.ParquetDecryptionConfig, default None
   ```



##########
python/pyarrow/includes/libarrow_parquet_readwrite.pxd:
##########
@@ -0,0 +1,32 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# distutils: language = c++
+
+from pyarrow.includes.libarrow_dataset cimport *
+from pyarrow._parquet cimport *
+
+cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:

Review Comment:
   I assume this is related to having the separate pxd file for encryption (see 
comment below https://github.com/apache/arrow/pull/34616/files#r1317199494)? 
   If we don't split those in a separate file, it might not work to do the
   
   ```
   from pyarrow.includes.libarrow_dataset_parquet cimport *
   
   IF PARQUET_ENCRYPTION_ENABLED:
       from pyarrow.includes.libarrow_parquet_readwrite_encryption cimport *
       from pyarrow._parquet_encryption cimport *
   ELSE:
       from pyarrow.includes.libarrow_parquet_readwrite cimport *
   ```
   
   Because otherwise if `CParquetFileWriteOptions` was also defined in 
`libarrow_dataset_parquet.pxd`, importing it from 
`libarrow_parquet_readwrite_encryption` would duplicate the imported 
declaration (and not sure if cython would be happy about that)



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -78,7 +239,8 @@ cdef class ParquetFileFormat(FileFormat):
         CParquetFileFormat* parquet_format
 
     def __init__(self, read_options=None,
-                 default_fragment_scan_options=None, **kwargs):
+                 default_fragment_scan_options=None,
+                 **kwargs):

Review Comment:
   Small general comment: while there is nothing wrong with this change, trying 
to avoid such stylistic changes (or other whitespace changes) does make it 
easier to review the diff.



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +63,163 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : CryptoFactory
+            Factory for creating cryptographic instances.
+        kms_connection_config : KmsConnectionConfig
+            Configuration for connecting to Key Management Service.
+        encryption_config : EncryptionConfiguration

Review Comment:
   ```suggestion
           encryption_config : 
pyarrow.parquet.encryption.EncryptionConfiguration
   ```
   
   This is long, but I think being explicit with the full name helps to 
understand which objects are from the parquet module and which objects from the 
dataset module (as that still confuses me in this PR, especially with the very 
similar names EncryptionConfiguration vs ParquetEncryptionConfig)
   
   (and the same for the ParquetDecryptionConfig docstring)



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,65 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+
+namespace parquet::encryption {
+class CryptoFactory;
+struct KmsConnectionConfig;
+struct EncryptionConfiguration;
+struct DecryptionConfiguration;
+}  // namespace parquet::encryption
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT ParquetEncryptionConfig {

Review Comment:
   Could you expand this doc comment a bit more? For example in the scan 
options parameter above this is explained as "configuration structure that 
provides encryption properties for a dataset", and in the python docs as 
"Configuration for Parquet Encryption".  
   I also don't directly understand what "translates parameters" means in this 
context (not being familiar with the encryption implementation, though)
   
   (and then when expanding the explanation here, you can add the same in the 
python docstring)



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -711,6 +889,20 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     cdef ArrowReaderProperties* arrow_reader_properties(self):
         return self.parquet_options.arrow_reader_properties.get()
 
+    IF PARQUET_ENCRYPTION_ENABLED:
+        @property
+        def parquet_decryption_config(self):

Review Comment:
   We could also always add the property, but let it raise NotImplementedError 
when encryption is not enabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to