calluw opened a new issue, #39444: URL: https://github.com/apache/arrow/issues/39444
### Describe the bug, including details regarding any error messages, version, and platform. Version: `pyarrow==14.0.2` I have been trying out the new capability introduced in Arrow 14 which allows modular encryption of Parquet files to be used alongside the generic Dataset API, but began to experience segmentation faults in C++ Arrow code when working with real, partitioned datasets rather than toy examples. The error is most often segmentation fault but I have seen all of these: ``` Segmentation fault OSError: Failed decryption finalization OSError: Couldn't set AAD ``` After some trial and error generating larger and larger toy examples until the error was reliably hit, I discovered that the threshold was when the data which is *initially written* to the dataset on disk with modular encryption was over `2^15`. In testing, `2^15` rows never faulted, but `2^15 + 1` often faults (occasionally it does not). The backtrace when triggering the fault or error always ends in the same function: ``` #9 0x00007f9b726625f2 in parquet::encryption::AesDecryptor::AesDecryptorImpl::GcmDecrypt(unsigned char const*, int, unsigned char const*, int, unsigned char const*, int, unsigned char*) [clone .cold] () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400 #10 0x00007f9b7270923e in parquet::Decryptor::Decrypt(unsigned char const*, int, unsigned char*) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400 #11 0x00007f9b726f569b in void parquet::ThriftDeserializer::DeserializeMessage<parquet::format::ColumnMetaData>(unsigned char const*, unsigned int*, parquet::format::ColumnMetaData*, parquet::Decryptor*) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400 #12 0x00007f9b726f8f34 in parquet::ColumnChunkMetaData::ColumnChunkMetaDataImpl::ColumnChunkMetaDataImpl(parquet::format::ColumnChunk const*, parquet::ColumnDescriptor const*, short, short, parquet::ReaderProperties const&, parquet::ApplicationVersion const*, std::shared_ptr<parquet::InternalFileDecryptor>) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400 ``` Corresponding to this source: https://github.com/apache/arrow/blob/main/cpp/src/parquet/encryption/encryption_internal.cc#L453 I found this previous fix in this area, the issue for which details the exact same symptoms: https://github.com/apache/arrow/commit/88bccab18a4ab818355209e45862cc52f9cf4a0d ``` However, AesDecryptor::AesDecryptorImpl::GcmDecrypt() and AesDecryptor::AesDecryptorImpl::CtrDecrypt() use ctx_ member of type EVP_CIPHER_CTX from OpenSSL, which shouldn't be used from multiple threads concurrently. ``` So I was suspicious that it could be a multi-threading issue during read or write. Whilst attempting to narrow down the cause (and whether the root cause occurs during writing or reading), I made the following observations: - Writing `2^15 + 1` rows, deleting half the dataset and then reading in the full dataset still encounters the error - Writing `2^15 + 1` rows, and filtering to half the partitions in the dataset when reading still encounters the error - Writing `2^15 + 1` rows, and decrypting with modular encryption each individual parquet file in the dataset and concatenating them never failed, only doing so as a full dataset - Writing or reading or both with `use_threads=False` still encounters the error The fact that the issue still occurs when the dataset on disk is halved and then read again suggested corruption during write, but the fact that all individual Parquet files are still independently readable suggests an issue with modular decryption during Dataset operations. Issue still occurring with threading disabled was unexpected. The error is reproducible using any random data, and using a KMS client which actually does no encryption at all (eliminating our custom KMS client as a probable cause), but simply passes the keys used around in (encoded) plaintext. It occurs whether the footer is plaintext or not, or whether envelope encryption is used or not. Reproduction in Python here (no partitions needed at all, so it produces a single Parquet file, which can be read normally): ```python3 import base64 import numpy as np import tempfile import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import pyarrow.parquet.encryption as pqe class NoOpKmsClient(pqe.KmsClient): def __init__(self): super().__init__() def wrap_key(self, key_bytes: bytes, _: str) -> bytes: b = base64.b64encode(key_bytes) return b def unwrap_key(self, wrapped_key: bytes, _: str) -> bytes: b = base64.b64decode(wrapped_key) return b row_count = pow(2, 15) + 1 table = pa.Table.from_arrays([pa.array(np.random.rand(row_count), type=pa.float32())], names=["foo"]) kms_config = pqe.KmsConnectionConfig() crypto_factory = pqe.CryptoFactory(lambda _: NoOpKmsClient()) encryption_config = pqe.EncryptionConfiguration( footer_key="UNIMPORTANT_KEY", column_keys={"UNIMPORTANT_KEY": ["foo"]}, double_wrapping=True, plaintext_footer=False, data_key_length_bits=128, ) pqe_config = ds.ParquetEncryptionConfig(crypto_factory, kms_config, encryption_config) pqd_config = ds.ParquetDecryptionConfig(crypto_factory, kms_config, pqe.DecryptionConfiguration()) scan_options = ds.ParquetFragmentScanOptions(decryption_config=pqd_config) file_format = ds.ParquetFileFormat(default_fragment_scan_options=scan_options) write_options = file_format.make_write_options(encryption_config=pqe_config) file_decryption_properties = crypto_factory.file_decryption_properties(kms_config) with tempfile.TemporaryDirectory() as tempdir: path = tempdir + "/test-dataset" ds.write_dataset(table, path, format=file_format, file_options=write_options) file_path = path + "/part-0.parquet" new_table = pq.ParquetFile(file_path, decryption_properties=file_decryption_properties).read() assert table == new_table dataset = ds.dataset(path, format=file_format) new_table = dataset.to_table() assert table == new_table ``` Any help here would be much appreciated: being restricted to `2^15` rows is a roadblock for us for the use of this feature. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
