pentschev opened a new issue, #39619:
URL: https://github.com/apache/arrow/issues/39619
### Describe the bug, including details regarding any error messages,
version, and platform.
We have a complex setup with multiple NVIDIA GPUs, and thus not easily
reproducible, where we are observing segmentation faults with (Py)Arrow 14.0.1,
see stack below:
<details><summary>Stack trace</summary>
```c++
#0 arrow::internal::FnOnce<void ()>::operator()() && (this=0x7efba962f430)
at /usr/src/arrow/cpp/src/arrow/util/functional.h:140
#1 0x00007efd41953293 in
arrow::internal::SerialExecutor::RunTasksOnAllExecutors () at
/usr/src/arrow/cpp/src/arrow/util/thread_pool.cc:317
#2 0x00007efd418fd31f in arrow::ConcreteFutureImpl::DoWait
(this=0x7efba400e8d0) at /usr/src/arrow/cpp/src/arrow/util/future.cc:163
#3 0x00007efd418fa65a in arrow::FutureImpl::Wait (this=0x7efba400e8d0) at
/usr/src/arrow/cpp/src/arrow/util/future.cc:220
#4 0x00007efd41833b22 in arrow::Future<std::shared_ptr<arrow::Buffer>
>::Wait (this=0x7efba962f590) at /usr/src/arrow/cpp/src/arrow/util/future.h:385
#5 0x00007efd41831f70 in arrow::Future<std::shared_ptr<arrow::Buffer>
>::result() const & (this=0x7efba962f590) at
/usr/src/arrow/cpp/src/arrow/util/future.h:356
#6 0x00007efd4183022c in arrow::io::internal::ReadRangeCache::Impl::Read
(this=0x7efba40081f0, range=...) at
/usr/src/arrow/cpp/src/arrow/io/caching.cc:209
#7 0x00007efd41830ded in
arrow::io::internal::ReadRangeCache::LazyImpl::Read (this=0x7efba40081f0,
range=...) at /usr/src/arrow/cpp/src/arrow/io/caching.cc:294
#8 0x00007efd4182f35d in arrow::io::internal::ReadRangeCache::Read
(this=0x7efba4010ce0, range=...) at
/usr/src/arrow/cpp/src/arrow/io/caching.cc:325
#9 0x00007efc24942d2e in parquet::SerializedRowGroup::GetColumnPageReader
(this=0x7efba4193480, i=129) at
/usr/src/arrow/cpp/src/parquet/file_reader.cc:213
#10 0x00007efc2493e66c in parquet::RowGroupReader::GetColumnPageReader
(this=0x7efba41cdd20, i=129) at
/usr/src/arrow/cpp/src/parquet/file_reader.cc:131
#11 0x00007efc246eaae2 in parquet::arrow::FileColumnIterator::NextChunk
(this=0x7efba41a5470) at
/usr/src/arrow/cpp/src/parquet/arrow/reader_internal.h:80
#12 0x00007efc246d448c in parquet::arrow::(anonymous
namespace)::LeafReader::NextRowGroup (this=0x7efba63a8e20) at
/usr/src/arrow/cpp/src/parquet/arrow/reader.cc:505
#13 0x00007efc246d41f2 in parquet::arrow::(anonymous
namespace)::LeafReader::LoadBatch (this=0x7efba63a8e20, records_to_read=255000)
at /usr/src/arrow/cpp/src/parquet/arrow/reader.cc:485
#14 0x00007efc246eac67 in parquet::arrow::ColumnReaderImpl::NextBatch
(this=0x7efba63a8e20, batch_size=510000, out=0x7efba962fc40) at
/usr/src/arrow/cpp/src/parquet/arrow/reader.cc:109
#15 0x00007efc246d2a68 in parquet::arrow::(anonymous
namespace)::FileReaderImpl::ReadColumn (this=0x7efba4003230, i=0,
row_groups=std::vector of length 2, capacity 2 = {...}, reader=0x7efba63a8e20,
out=0x7efba962fc40)
at /usr/src/arrow/cpp/src/parquet/arrow/reader.cc:284
#16 0x00007efc246d9acd in operator() (__closure=0x7efba962fdd0, i=0,
reader=std::shared_ptr<parquet::arrow::ColumnReaderImpl> (use count 2, weak
count 0) = {...}) at /usr/src/arrow/cpp/src/parquet/arrow/reader.cc:1252
#17 0x00007efc246dd55a in
arrow::internal::OptionalParallelForAsync<parquet::arrow::(anonymous
namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous
namespace)::FileReaderImpl>, const std::vector<int>&, const std::vector<int>&,
arrow::internal::Executor*)::<lambda(size_t,
std::shared_ptr<parquet::arrow::ColumnReaderImpl>)>&,
std::shared_ptr<parquet::arrow::ColumnReaderImpl> >(bool,
std::vector<std::shared_ptr<parquet::arrow::ColumnReaderImpl>,
std::allocator<std::shared_ptr<parquet::arrow::ColumnReaderImpl> > >, struct
{...} &, arrow::internal::Executor *) (use_threads=false, inputs=std::vector of
length 2, capacity 2 = {...}, func=..., executor=0x5559cb52c120)
at /usr/src/arrow/cpp/src/arrow/util/parallel.h:95
#18 0x00007efc246da1fc in parquet::arrow::(anonymous
namespace)::FileReaderImpl::DecodeRowGroups (this=0x7efba4003230,
self=std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>
(empty) = {...},
row_groups=std::vector of length 2, capacity 2 = {...},
column_indices=std::vector of length 3, capacity 4 = {...},
cpu_executor=0x5559cb52c120) at
/usr/src/arrow/cpp/src/parquet/arrow/reader.cc:1270
#19 0x00007efc246d9889 in parquet::arrow::(anonymous
namespace)::FileReaderImpl::ReadRowGroups (this=0x7efba4003230,
row_groups=std::vector of length 2, capacity 2 = {...},
column_indices=std::vector of length 3, capacity 4 = {...},
out=0x7efba962ff90) at
/usr/src/arrow/cpp/src/parquet/arrow/reader.cc:1232
#20 0x00007efd126dab43 in
__pyx_pw_7pyarrow_8_parquet_13ParquetReader_14read_row_groups(_object*,
_object* const*, long, _object*) () from
/usr/local/lib/python3.10/dist-packages/pyarrow/_parquet.cpython-310-x86_64-linux-gnu.so
#21 0x00005559b14d0ae6 in ?? ()
#22 0x00005559b14ac53c in _PyEval_EvalFrameDefault ()
#23 0x00005559b14d07f1 in ?? ()
#24 0x00005559b14d1492 in PyObject_Call ()
#25 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
#26 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#27 0x00005559b14d1492 in PyObject_Call ()
#28 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
#29 0x00005559b14d07f1 in ?? ()
#30 0x00005559b14d1492 in PyObject_Call ()
#31 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
#32 0x00005559b14d07f1 in ?? ()
#33 0x00005559b14d1492 in PyObject_Call ()
#34 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
#35 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#36 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
#37 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#38 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
#39 0x00005559b14b7c14 in _PyObject_FastCallDictTstate ()
#40 0x00005559b14cd86c in _PyObject_Call_Prepend ()
#41 0x00005559b15e8700 in ?? ()
#42 0x00005559b14d142b in PyObject_Call ()
#43 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
#44 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#45 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
#46 0x00005559b14dfcc2 in ?? ()
#47 0x00005559b148d860 in PySequence_Tuple ()
#48 0x00005559b14b1bfa in _PyEval_EvalFrameDefault ()
#49 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#50 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
#51 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#52 0x00005559b14b0cfa in _PyEval_EvalFrameDefault ()
#53 0x00005559b14b7c14 in _PyObject_FastCallDictTstate ()
#54 0x00005559b14cd86c in _PyObject_Call_Prepend ()
#55 0x00005559b15e8700 in ?? ()
#56 0x00005559b14d142b in PyObject_Call ()
#57 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
#58 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#59 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
#60 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#61 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
#62 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#63 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
#64 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#65 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
#66 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#67 0x00005559b14ab45c in _PyEval_EvalFrameDefault ()
#68 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#69 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
#70 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#71 0x00005559b14ab45c in _PyEval_EvalFrameDefault ()
#72 0x00005559b14c29fc in _PyFunction_Vectorcall ()
#73 0x00005559b14ab45c in _PyEval_EvalFrameDefault ()
#74 0x00005559b14d0a51 in ?? ()
#75 0x00005559b15f9f3a in ?? ()
#76 0x00005559b15eeef8 in ?? ()
#77 0x00007eff67862ac3 in start_thread (arg=<optimized out>) at
./nptl/pthread_create.c:442
#78 0x00007eff678f3a04 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:100
```
</details>
This occurs where we run a batch of tests and from what I can tell it
doesn't always happen in the same test which also means this is
non-deterministic.
The change above seems to have been introduced via
https://github.com/apache/arrow/pull/35672 and only released in Arrow 14.x, our
previous software stack was running Arrow 12.x and thus that wasn't observable.
Additionally, the segfault seems to only occur if Arrow is built with
`ARROW_ENABLE_THREADING=OFF`, I'm not sure yet if we're building Arrow C++
libraries ourselves or installing it from official Arrow binaries (still
inquiring internally), thus the error may only be observed on binaries where
`ARROW_ENABLE_THREADING=OFF`, so it may or may not affect users who are
installing official builds.
Finally, I'm still not able to upgrade to Arrow 14.0.2 to check if the error
persists as I need some more information from our internal build team, but I
couldn't find any open or closed issues mentioning the same segmentation fault
so I believe it's still not fixed but please let me know if I missed something.
The code is a bit complex and would take me quite some time to debug it on
my own, so I'm hoping someone knowledgeable about this could help narrow this
down. I'm also happy to provide further information, but as of this moment I'm
still unable to provide a minimal reproducer.
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]