[I] Segmentation fault reading modular encrypted Parquet dataset over 2^15 rows [arrow]

via GitHub Wed, 03 Jan 2024 08:12:09 -0800


calluw opened a new issue, #39444:
URL: https://github.com/apache/arrow/issues/39444


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Version: `pyarrow==14.0.2`
   
   I have been trying out the new capability introduced in Arrow 14 which 
allows modular encryption of Parquet files to be used alongside the generic 
Dataset API, but began to experience segmentation faults in C++ Arrow code when 
working with real, partitioned datasets rather than toy examples.
   
   The error is most often segmentation fault but I have seen all of these:
   ```
   Segmentation fault
   OSError: Failed decryption finalization
   OSError: Couldn't set AAD
   ```
   
   After some trial and error generating larger and larger toy examples until 
the error was reliably hit, I discovered that the threshold was when the data 
which is *initially written* to the dataset on disk with modular encryption was 
over `2^15`. In testing, `2^15` rows never faulted, but `2^15 + 1` often faults 
(occasionally it does not).
   
   The backtrace when triggering the fault or error always ends in the same 
function:
   ```
   #9  0x00007f9b726625f2 in 
parquet::encryption::AesDecryptor::AesDecryptorImpl::GcmDecrypt(unsigned char 
const*, int, unsigned char const*, int, unsigned char const*, int, unsigned 
char*) [clone .cold] () from 
venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
   #10 0x00007f9b7270923e in parquet::Decryptor::Decrypt(unsigned char const*, 
int, unsigned char*) () from 
venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
   #11 0x00007f9b726f569b in void 
parquet::ThriftDeserializer::DeserializeMessage<parquet::format::ColumnMetaData>(unsigned
 char const*, unsigned int*, parquet::format::ColumnMetaData*, 
parquet::Decryptor*) () from 
venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
   #12 0x00007f9b726f8f34 in 
parquet::ColumnChunkMetaData::ColumnChunkMetaDataImpl::ColumnChunkMetaDataImpl(parquet::format::ColumnChunk
 const*, parquet::ColumnDescriptor const*, short, short, 
parquet::ReaderProperties const&, parquet::ApplicationVersion const*, 
std::shared_ptr<parquet::InternalFileDecryptor>) () from 
venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
   ```
   Corresponding to this source: 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/encryption/encryption_internal.cc#L453
   
   I found this previous fix in this area, the issue for which details the 
exact same symptoms: 
https://github.com/apache/arrow/commit/88bccab18a4ab818355209e45862cc52f9cf4a0d
   ```
   However, AesDecryptor::AesDecryptorImpl::GcmDecrypt() and 
AesDecryptor::AesDecryptorImpl::CtrDecrypt() use ctx_ member of type 
EVP_CIPHER_CTX from OpenSSL, which shouldn't be used from multiple threads 
concurrently.
   ```
   So I was suspicious that it could be a multi-threading issue during read or 
write.
   
   Whilst attempting to narrow down the cause (and whether the root cause 
occurs during writing or reading), I made the following observations:
   - Writing `2^15 + 1` rows, deleting half the dataset and then reading in the 
full dataset still encounters the error
   - Writing `2^15 + 1` rows, and filtering to half the partitions in the 
dataset when reading still encounters the error
   - Writing `2^15 + 1` rows, and decrypting with modular encryption each 
individual parquet file in the dataset and concatenating them never failed, 
only doing so as a full dataset
   - Writing or reading or both with `use_threads=False` still encounters the 
error
   
   The fact that the issue still occurs when the dataset on disk is halved and 
then read again suggested corruption during write, but the fact that all 
individual Parquet files are still independently readable suggests an issue 
with modular decryption during Dataset operations. Issue still occurring with 
threading disabled was unexpected.
   
   The error is reproducible using any random data, and using a KMS client 
which actually does no encryption at all (eliminating our custom KMS client as 
a probable cause), but simply passes the keys used around in (encoded) 
plaintext. It occurs whether the footer is plaintext or not, or whether 
envelope encryption is used or not.
   
   Reproduction in Python here (no partitions needed at all, so it produces a 
single Parquet file, which can be read normally):
   ```python3
   import base64
   import numpy as np
   import tempfile
   
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   import pyarrow.parquet.encryption as pqe
   
   class NoOpKmsClient(pqe.KmsClient):
       def __init__(self):
           super().__init__()
   
       def wrap_key(self, key_bytes: bytes, _: str) -> bytes:
           b = base64.b64encode(key_bytes)
           return b
   
       def unwrap_key(self, wrapped_key: bytes, _: str) -> bytes:
           b = base64.b64decode(wrapped_key)
           return b
   
   row_count = pow(2, 15) + 1
   table = pa.Table.from_arrays([pa.array(np.random.rand(row_count), 
type=pa.float32())], names=["foo"])
   
   kms_config = pqe.KmsConnectionConfig()
   crypto_factory = pqe.CryptoFactory(lambda _: NoOpKmsClient())
   encryption_config = pqe.EncryptionConfiguration(
       footer_key="UNIMPORTANT_KEY",
       column_keys={"UNIMPORTANT_KEY": ["foo"]},
       double_wrapping=True,
       plaintext_footer=False,
       data_key_length_bits=128,
   )
   pqe_config = ds.ParquetEncryptionConfig(crypto_factory, kms_config, 
encryption_config)
   pqd_config = ds.ParquetDecryptionConfig(crypto_factory, kms_config, 
pqe.DecryptionConfiguration())
   scan_options = ds.ParquetFragmentScanOptions(decryption_config=pqd_config)
   file_format = 
ds.ParquetFileFormat(default_fragment_scan_options=scan_options)
   write_options = file_format.make_write_options(encryption_config=pqe_config)
   file_decryption_properties = 
crypto_factory.file_decryption_properties(kms_config)
   
   with tempfile.TemporaryDirectory() as tempdir:
       path = tempdir + "/test-dataset"
       ds.write_dataset(table, path, format=file_format, 
file_options=write_options)
   
       file_path = path + "/part-0.parquet"
       new_table = pq.ParquetFile(file_path, 
decryption_properties=file_decryption_properties).read()
       assert table == new_table
   
       dataset = ds.dataset(path, format=file_format)
       new_table = dataset.to_table()
       assert table == new_table
   ```
   
   Any help here would be much appreciated: being restricted to `2^15` rows is 
a roadblock for us for the use of this feature.
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Segmentation fault reading modular encrypted Parquet dataset over 2^15 rows [arrow]

Reply via email to