pentschev opened a new issue, #39619:
URL: https://github.com/apache/arrow/issues/39619

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   We have a complex setup with multiple NVIDIA GPUs, and thus not easily 
reproducible, where we are observing segmentation faults with (Py)Arrow 14.0.1, 
see stack below:
   
   <details><summary>Stack trace</summary>
   
   ```c++
   #0  arrow::internal::FnOnce<void ()>::operator()() && (this=0x7efba962f430) 
at /usr/src/arrow/cpp/src/arrow/util/functional.h:140
   #1  0x00007efd41953293 in 
arrow::internal::SerialExecutor::RunTasksOnAllExecutors () at 
/usr/src/arrow/cpp/src/arrow/util/thread_pool.cc:317
   #2  0x00007efd418fd31f in arrow::ConcreteFutureImpl::DoWait 
(this=0x7efba400e8d0) at /usr/src/arrow/cpp/src/arrow/util/future.cc:163
   #3  0x00007efd418fa65a in arrow::FutureImpl::Wait (this=0x7efba400e8d0) at 
/usr/src/arrow/cpp/src/arrow/util/future.cc:220
   #4  0x00007efd41833b22 in arrow::Future<std::shared_ptr<arrow::Buffer> 
>::Wait (this=0x7efba962f590) at /usr/src/arrow/cpp/src/arrow/util/future.h:385
   #5  0x00007efd41831f70 in arrow::Future<std::shared_ptr<arrow::Buffer> 
>::result() const & (this=0x7efba962f590) at 
/usr/src/arrow/cpp/src/arrow/util/future.h:356
   #6  0x00007efd4183022c in arrow::io::internal::ReadRangeCache::Impl::Read 
(this=0x7efba40081f0, range=...) at 
/usr/src/arrow/cpp/src/arrow/io/caching.cc:209
   #7  0x00007efd41830ded in 
arrow::io::internal::ReadRangeCache::LazyImpl::Read (this=0x7efba40081f0, 
range=...) at /usr/src/arrow/cpp/src/arrow/io/caching.cc:294
   #8  0x00007efd4182f35d in arrow::io::internal::ReadRangeCache::Read 
(this=0x7efba4010ce0, range=...) at 
/usr/src/arrow/cpp/src/arrow/io/caching.cc:325
   #9  0x00007efc24942d2e in parquet::SerializedRowGroup::GetColumnPageReader 
(this=0x7efba4193480, i=129) at 
/usr/src/arrow/cpp/src/parquet/file_reader.cc:213
   #10 0x00007efc2493e66c in parquet::RowGroupReader::GetColumnPageReader 
(this=0x7efba41cdd20, i=129) at 
/usr/src/arrow/cpp/src/parquet/file_reader.cc:131
   #11 0x00007efc246eaae2 in parquet::arrow::FileColumnIterator::NextChunk 
(this=0x7efba41a5470) at 
/usr/src/arrow/cpp/src/parquet/arrow/reader_internal.h:80
   #12 0x00007efc246d448c in parquet::arrow::(anonymous 
namespace)::LeafReader::NextRowGroup (this=0x7efba63a8e20) at 
/usr/src/arrow/cpp/src/parquet/arrow/reader.cc:505
   #13 0x00007efc246d41f2 in parquet::arrow::(anonymous 
namespace)::LeafReader::LoadBatch (this=0x7efba63a8e20, records_to_read=255000) 
at /usr/src/arrow/cpp/src/parquet/arrow/reader.cc:485
   #14 0x00007efc246eac67 in parquet::arrow::ColumnReaderImpl::NextBatch 
(this=0x7efba63a8e20, batch_size=510000, out=0x7efba962fc40) at 
/usr/src/arrow/cpp/src/parquet/arrow/reader.cc:109
   #15 0x00007efc246d2a68 in parquet::arrow::(anonymous 
namespace)::FileReaderImpl::ReadColumn (this=0x7efba4003230, i=0, 
row_groups=std::vector of length 2, capacity 2 = {...}, reader=0x7efba63a8e20, 
out=0x7efba962fc40)
       at /usr/src/arrow/cpp/src/parquet/arrow/reader.cc:284
   #16 0x00007efc246d9acd in operator() (__closure=0x7efba962fdd0, i=0, 
reader=std::shared_ptr<parquet::arrow::ColumnReaderImpl> (use count 2, weak 
count 0) = {...}) at /usr/src/arrow/cpp/src/parquet/arrow/reader.cc:1252
   #17 0x00007efc246dd55a in 
arrow::internal::OptionalParallelForAsync<parquet::arrow::(anonymous 
namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous
 namespace)::FileReaderImpl>, const std::vector<int>&, const std::vector<int>&, 
arrow::internal::Executor*)::<lambda(size_t, 
std::shared_ptr<parquet::arrow::ColumnReaderImpl>)>&, 
std::shared_ptr<parquet::arrow::ColumnReaderImpl> >(bool, 
std::vector<std::shared_ptr<parquet::arrow::ColumnReaderImpl>, 
std::allocator<std::shared_ptr<parquet::arrow::ColumnReaderImpl> > >, struct 
{...} &, arrow::internal::Executor *) (use_threads=false, inputs=std::vector of 
length 2, capacity 2 = {...}, func=..., executor=0x5559cb52c120)
       at /usr/src/arrow/cpp/src/arrow/util/parallel.h:95
   #18 0x00007efc246da1fc in parquet::arrow::(anonymous 
namespace)::FileReaderImpl::DecodeRowGroups (this=0x7efba4003230, 
self=std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl> 
(empty) = {...},
       row_groups=std::vector of length 2, capacity 2 = {...}, 
column_indices=std::vector of length 3, capacity 4 = {...}, 
cpu_executor=0x5559cb52c120) at 
/usr/src/arrow/cpp/src/parquet/arrow/reader.cc:1270
   #19 0x00007efc246d9889 in parquet::arrow::(anonymous 
namespace)::FileReaderImpl::ReadRowGroups (this=0x7efba4003230, 
row_groups=std::vector of length 2, capacity 2 = {...}, 
column_indices=std::vector of length 3, capacity 4 = {...},
       out=0x7efba962ff90) at 
/usr/src/arrow/cpp/src/parquet/arrow/reader.cc:1232
   #20 0x00007efd126dab43 in 
__pyx_pw_7pyarrow_8_parquet_13ParquetReader_14read_row_groups(_object*, 
_object* const*, long, _object*) () from 
/usr/local/lib/python3.10/dist-packages/pyarrow/_parquet.cpython-310-x86_64-linux-gnu.so
   #21 0x00005559b14d0ae6 in ?? ()
   #22 0x00005559b14ac53c in _PyEval_EvalFrameDefault ()
   #23 0x00005559b14d07f1 in ?? ()
   #24 0x00005559b14d1492 in PyObject_Call ()
   #25 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
   #26 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #27 0x00005559b14d1492 in PyObject_Call ()
   #28 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
   #29 0x00005559b14d07f1 in ?? ()
   #30 0x00005559b14d1492 in PyObject_Call ()
   #31 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
   #32 0x00005559b14d07f1 in ?? ()
   #33 0x00005559b14d1492 in PyObject_Call ()
   #34 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
   #35 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #36 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
   #37 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #38 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
   #39 0x00005559b14b7c14 in _PyObject_FastCallDictTstate ()
   #40 0x00005559b14cd86c in _PyObject_Call_Prepend ()
   #41 0x00005559b15e8700 in ?? ()
   #42 0x00005559b14d142b in PyObject_Call ()
   #43 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
   #44 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #45 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
   #46 0x00005559b14dfcc2 in ?? ()
   #47 0x00005559b148d860 in PySequence_Tuple ()
   #48 0x00005559b14b1bfa in _PyEval_EvalFrameDefault ()
   #49 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #50 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
   #51 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #52 0x00005559b14b0cfa in _PyEval_EvalFrameDefault ()
   #53 0x00005559b14b7c14 in _PyObject_FastCallDictTstate ()
   #54 0x00005559b14cd86c in _PyObject_Call_Prepend ()
   #55 0x00005559b15e8700 in ?? ()
   #56 0x00005559b14d142b in PyObject_Call ()
   #57 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
   #58 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #59 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
   #60 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #61 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
   #62 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #63 0x00005559b14ab26d in _PyEval_EvalFrameDefault ()
   #64 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #65 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
   #66 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #67 0x00005559b14ab45c in _PyEval_EvalFrameDefault ()
   #68 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #69 0x00005559b14ad5d7 in _PyEval_EvalFrameDefault ()
   #70 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #71 0x00005559b14ab45c in _PyEval_EvalFrameDefault ()
   #72 0x00005559b14c29fc in _PyFunction_Vectorcall ()
   #73 0x00005559b14ab45c in _PyEval_EvalFrameDefault ()
   #74 0x00005559b14d0a51 in ?? ()
   #75 0x00005559b15f9f3a in ?? ()
   #76 0x00005559b15eeef8 in ?? ()
   #77 0x00007eff67862ac3 in start_thread (arg=<optimized out>) at 
./nptl/pthread_create.c:442
   #78 0x00007eff678f3a04 in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:100
   ```
   
   </details>
   
   This occurs where we run a batch of tests and from what I can tell it 
doesn't always happen in the same test which also means this is 
non-deterministic.
   
   The change above seems to have been introduced via 
https://github.com/apache/arrow/pull/35672 and only released in Arrow 14.x, our 
previous software stack was running Arrow 12.x and thus that wasn't observable. 
Additionally, the segfault seems to only occur if Arrow is built with 
`ARROW_ENABLE_THREADING=OFF`, I'm not sure yet if we're building Arrow C++ 
libraries ourselves or installing it from official Arrow binaries (still 
inquiring internally), thus the error may only be observed on binaries where 
`ARROW_ENABLE_THREADING=OFF`, so it may or may not affect users who are 
installing official builds.
   
   Finally, I'm still not able to upgrade to Arrow 14.0.2 to check if the error 
persists as I need some more information from our internal build team, but I 
couldn't find any open or closed issues mentioning the same segmentation fault 
so I believe it's still not fixed but please let me know if I missed something.
   
   The code is a bit complex and would take me quite some time to debug it on 
my own, so I'm hoping someone knowledgeable about this could help narrow this 
down. I'm also happy to provide further information, but as of this moment I'm 
still unable to provide a minimal reproducer.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to