jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1340143239
##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +63,163 @@ from pyarrow._parquet cimport (
cdef Expression _true = Expression._scalar(True)
-
ctypedef CParquetFileWriter* _CParquetFileWriterPtr
+IF PARQUET_ENCRYPTION_ENABLED:
+ cdef class ParquetEncryptionConfig(_Weakrefable):
+ """
+ Configuration for Parquet Encryption.
+
+ Parameters
+ ----------
+ crypto_factory : CryptoFactory
+ Factory for creating cryptographic instances.
+ kms_connection_config : KmsConnectionConfig
Review Comment:
```suggestion
crypto_factory : pyarrow.parquet.encryption.CryptoFactory
Factory for creating cryptographic instances.
kms_connection_config :
pyarrow.parquet.encryption.KmsConnectionConfig
```
##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -678,10 +843,14 @@ cdef class
ParquetFragmentScanOptions(FragmentScanOptions):
If not None, override the maximum total size of containers allocated
when decoding Thrift structures. The default limit should be
sufficient for most Parquet files.
+ decryption_config : ParquetDecryptionConfig, default None
Review Comment:
```suggestion
decryption_config : pyarrow.dataset.ParquetDecryptionConfig, default None
```
##########
python/pyarrow/includes/libarrow_parquet_readwrite.pxd:
##########
@@ -0,0 +1,32 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# distutils: language = c++
+
+from pyarrow.includes.libarrow_dataset cimport *
+from pyarrow._parquet cimport *
+
+cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
Review Comment:
I assume this is related to having the separate pxd file for encryption (see
comment below https://github.com/apache/arrow/pull/34616/files#r1317199494)?
If we don't split those in a separate file, it might not work to do the
```
from pyarrow.includes.libarrow_dataset_parquet cimport *
IF PARQUET_ENCRYPTION_ENABLED:
from pyarrow.includes.libarrow_parquet_readwrite_encryption cimport *
from pyarrow._parquet_encryption cimport *
ELSE:
from pyarrow.includes.libarrow_parquet_readwrite cimport *
```
Because otherwise if `CParquetFileWriteOptions` was also defined in
`libarrow_dataset_parquet.pxd`, importing it from
`libarrow_parquet_readwrite_encryption` would duplicate the imported
declaration (and not sure if cython would be happy about that)
##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -78,7 +239,8 @@ cdef class ParquetFileFormat(FileFormat):
CParquetFileFormat* parquet_format
def __init__(self, read_options=None,
- default_fragment_scan_options=None, **kwargs):
+ default_fragment_scan_options=None,
+ **kwargs):
Review Comment:
Small general comment: while there is nothing wrong with this change, trying
to avoid such stylistic changes (or other whitespace changes) does make it
easier to review the diff.
##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +63,163 @@ from pyarrow._parquet cimport (
cdef Expression _true = Expression._scalar(True)
-
ctypedef CParquetFileWriter* _CParquetFileWriterPtr
+IF PARQUET_ENCRYPTION_ENABLED:
+ cdef class ParquetEncryptionConfig(_Weakrefable):
+ """
+ Configuration for Parquet Encryption.
+
+ Parameters
+ ----------
+ crypto_factory : CryptoFactory
+ Factory for creating cryptographic instances.
+ kms_connection_config : KmsConnectionConfig
+ Configuration for connecting to Key Management Service.
+ encryption_config : EncryptionConfiguration
Review Comment:
```suggestion
encryption_config :
pyarrow.parquet.encryption.EncryptionConfiguration
```
This is long, but I think being explicit with the full name helps to
understand which objects are from the parquet module and which objects from the
dataset module (as that still confuses me in this PR, especially with the very
similar names EncryptionConfiguration vs ParquetEncryptionConfig)
(and the same for the ParquetDecryptionConfig docstring)
##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,65 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+
+namespace parquet::encryption {
+class CryptoFactory;
+struct KmsConnectionConfig;
+struct EncryptionConfiguration;
+struct DecryptionConfiguration;
+} // namespace parquet::encryption
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT ParquetEncryptionConfig {
Review Comment:
Could you expand this doc comment a bit more? For example in the scan
options parameter above this is explained as "configuration structure that
provides encryption properties for a dataset", and in the python docs as
"Configuration for Parquet Encryption".
I also don't directly understand what "translates parameters" means in this
context (not being familiar with the encryption implementation, though)
(and then when expanding the explanation here, you can add the same in the
python docstring)
##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -711,6 +889,20 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
cdef ArrowReaderProperties* arrow_reader_properties(self):
return self.parquet_options.arrow_reader_properties.get()
+ IF PARQUET_ENCRYPTION_ENABLED:
+ @property
+ def parquet_decryption_config(self):
Review Comment:
We could also always add the property, but let it raise NotImplementedError
when encryption is not enabled?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]