adamreeve commented on issue #39444:
URL: https://github.com/apache/arrow/issues/39444#issuecomment-1891107716

   I believe that the reason this issue is only seen with more than 2^15 rows 
is that this is the value of `kMaxBatchSize` used in Acero, and when writing 
the dataset, if there are more rows than this, the Parquet file is split into 
multiple row groups. As a workaround, I can get the reproduction code from the 
issue description to work without error if I use `ds.write_dataset(table, path, 
format=file_format, file_options=write_options, min_rows_per_group=row_count)`, 
so that Parquet files only ever have one row group.
   
   The reason this leads to crashes when multi-threading is enabled seems to be 
that when scanning the dataset, `RowGroupGenerator::read_one_grow_group` is 
called concurrently from different threads due to using a `ReadaheadGenerator`, 
leading to concurrent use of the same `AesDecryptor` instances, which are not 
thread safe. Just putting mutexes around use of the `AesDecryptor`s isn't 
sufficient to fix the problem due to `Decryptor::UpdateAad` being used to 
update the `AAD` value as data pages are read, which then affects the use of 
the decryptor from other threads. The `InternalFileDecryptor::Get*Decryptor` 
methods are also called concurrently but are not thread safe due to modifying 
`std::map`s and the `all_decryptors_` vector.
   
   @eirki mentioned seeing this issue without using the Dataset API, and I 
believe this might also happen when `FileReaderImpl::DecodeRowGroups` is used 
to decode multiple row groups concurrently when threading is enabled, but I 
haven't tried reproducing that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to