Great analysis Weston!
Looks SimpleRecordBatch::column() is not thread safe for gcc < 5.0 as we are
simulating shared_ptr atomic load/store with normal load/store.
https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.cc#L80-L87
On 5/21/21 8:15 AM, Weston Pace wrote:
I like Yibo's stack overflow theory given the "error reading variable"
but I did confirm that I can cause a segmentation fault if
std::atomic_store / std::atomic_load are unavailable. I simulated
this by simply commenting out the specializations rather than actually
run against GCC 4.9.2 so it may not be perfect. I've attached a patch
with my stress test (based on the latest master,
#c697a41ab9c11511113e7387fe4710df920c36ed). Running that stress test
while running `stress -c 16` on my server reproduces it pretty
reliably.
Thread 1 (Thread 0x7f6ae05fc700 (LWP 2308757)):
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f6ae352e859 in __GI_abort () at abort.c:79
#2 0x00007f6ae37fe892 in __gnu_cxx::__verbose_terminate_handler () at
/home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007f6ae37fcf69 in __cxxabiv1::__terminate (handler=<optimized
out>) at
/home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
#4 0x00007f6ae37fcfab in std::terminate () at
/home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
#5 0x00007f6ae37fd9d0 in __cxxabiv1::__cxa_pure_virtual () at
/home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/pure.cc:50
#6 0x000055a64bc4400a in
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
(this=0x7f6ad0001160) at
/home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:155
#7 0x000055a64bc420f3 in
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
(this=0x7f6ae05fa568, __in_chrg=<optimized out>) at
/home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:730
#8 0x000055a64bc3a4a2 in std::__shared_ptr<arrow::Array,
(__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f6ae05fa560,
__in_chrg=<optimized out>) at
/home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1169
#9 0x000055a64bc3a4be in std::shared_ptr<arrow::Array>::~shared_ptr
(this=0x7f6ae05fa560, __in_chrg=<optimized out>) at
/home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr.h:103
#10 0x000055a64bc557ca in
arrow::TestRecordBatch_BatchColumnBoxingStress_Test::<lambda()>::operator()(void)
const (__closure=0x55a64d5f5218) at
../src/arrow/record_batch_test.cc:206
As a workaround to see if this is indeed your issue, you can call
RecordBatch::column on each of the columns as soon as you create the
RecordBatch (from one thread) which will force the boxed columns to
materialize.
-Weston
On Thu, May 20, 2021 at 11:40 AM Wes McKinney <wesmck...@gmail.com> wrote:
Also, is it possible that the field is not an Int64Array?
On Wed, May 19, 2021 at 10:19 PM Yibo Cai <yibo....@arm.com> wrote:
On 5/20/21 4:15 AM, Rares Vernica wrote:
Hello,
I'm using Arrow for accessing data outside the SciDB database engine. It
generally works fine but we are running into Segmentation Faults in a
corner multi-threaded case. I identified two threads that work on the same
Record Batch. I wonder if there is something internal about RecordBatch
that might help solve the mystery.
We are using Arrow 0.16.0. The backtrace of the triggering thread looks
like this:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fdad5fb4700 (LWP 3748)]
0x00007fdaa805abe0 in ?? ()
(gdb) thread
[Current thread is 2 (Thread 0x7fdad5fb4700 (LWP 3748))]
(gdb) bt
#0 0x00007fdaa805abe0 in ?? ()
#1 0x0000000000850212 in
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#2 0x00007fdae4b1fbf1 in
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
(this=0x7fdad5fb1ae8, __in_chrg=<optimized out>) at
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:666
#3 0x00007fdae4b39d74 in std::__shared_ptr<arrow::Array,
(__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fdad5fb1ae0,
__in_chrg=<optimized out>) at
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:914
#4 0x00007fdae4b39da8 in std::shared_ptr<arrow::Array>::~shared_ptr
(this=0x7fdad5fb1ae0, __in_chrg=<optimized out>) at
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:93
#5 0x00007fdae4b6a8e1 in scidb::XChunkIterator::getCoord
(this=0x7fdaa807f9f0, dim=1, index=1137) at XArray.cpp:358
#6 0x00007fdae4b68ecb in scidb::XChunkIterator::XChunkIterator
(this=0x7fdaa807f9f0, chunk=..., iterationMode=0, arrowBatch=<error reading
variable: Cannot access memory at address 0xd5fb1b90>) at XArray.cpp:157
FWIW, this "error reading variable" looks suspicious. Maybe the argument
'arrowBatch' is trashed accidentally (stack overflow)?
https://github.com/Paradigm4/bridge/blob/master/src/XArray.cpp#L132
...
The backtrace of the other thread working on exactly the same Record Batch
looks like this:
(gdb) thread
[Current thread is 3 (Thread 0x7fdad61b5700 (LWP 3746))]
(gdb) bt
#0 0x00007fdae3bc1ec7 in arrow::SimpleRecordBatch::column(int) const ()
from /lib64/libarrow.so.16
#1 0x00007fdae4b6a888 in scidb::XChunkIterator::getCoord
(this=0x7fdab00c0bb0, dim=0, index=71) at XArray.cpp:357
#2 0x00007fdae4b6a5a2 in scidb::XChunkIterator::operator++
(this=0x7fdab00c0bb0) at XArray.cpp:305
...
In both cases, the last non-Arrow code is the getCorord function
https://github.com/Paradigm4/bridge/blob/master/src/XArray.cpp#L355
int64_t XChunkIterator::getCoord(size_t dim, int64_t index)
{
return std::static_pointer_cast<arrow::Int64Array>(
_arrowBatch->column(_nAtts + dim))->raw_values()[index];
}
...
std::shared_ptr<const arrow::RecordBatch> _arrowBatch;
Do you see anything suspicious about this code? What would trigger the
shared_ptr destruction which takes place in thread 2?
Thank you!
Rares