xy720 opened a new pull request, #24840:
URL: https://github.com/apache/doris/pull/24840
## Proposed changes
In version 1.2.7, we found that the be core occurs about once a month. The
core dump is:
```
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fb96ac13d00 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1 0x000055a571705bf6 in __gthread_mutex_lock (__mutex=0x725f74756f656dd1)
at
/var/local/ldb-toolchain/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749
#2 std::mutex::lock (this=0x725f74756f656dd1) at
/var/local/ldb-toolchain/include/c++/11/bits/std_mutex.h:100
#3 std::unique_lock<std::mutex>::lock (this=0x7fb87c7efc00) at
/var/local/ldb-toolchain/include/c++/11/bits/unique_lock.h:139
#4 std::unique_lock<std::mutex>::unique_lock (__m=..., this=0x7fb87c7efc00)
at /var/local/ldb-toolchain/include/c++/11/bits/unique_lock.h:69
#5 doris::BlockingQueue<doris::vectorized::ScannerContext*>::blocking_put
(val=<synthetic pointer>: <optimized out>,
this=0x725f74756f656d69) at
/data/doris-1.x/be/src/util/blocking_queue.hpp:103
#6 doris::vectorized::ScannerScheduler::submit (this=<optimized out>,
ctx=0x7fb470295e00)
at /data/doris-1.x/be/src/vec/exec/scan/scanner_scheduler.cpp:100
#7 0x000055a5716c9360 in doris::vectorized::VScanNode::_start_scanners
(this=this@entry=0x7fb389e17000, scanners=...)
at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:1290
#8 0x000055a5716d9e53 in doris::vectorized::VScanNode::open
(this=0x7fb389e17000, state=<optimized out>)
at /data/doris-1.x/be/src/vec/exec/scan/vscan_node.cpp:102
#9 0x000055a56c777eef in
doris::PlanFragmentExecutor::open_vectorized_internal
(this=this@entry=0x7fb3e9c42680)
at /var/local/ldb-toolchain/include/c++/11/bits/unique_ptr.h:421
#10 0x000055a56c7795ae in doris::PlanFragmentExecutor::open
(this=this@entry=0x7fb3e9c42680)
at /data/doris-1.x/be/src/runtime/plan_fragment_executor.cpp:261
#11 0x000055a56c751c79 in doris::FragmentExecState::execute
(this=0x7fb3e9c42600)
at /data/doris-1.x/be/src/runtime/fragment_mgr.cpp:260
#12 0x000055a56c7552ce in
doris::FragmentMgr::_exec_actual(std::shared_ptr<doris::FragmentExecState>,
std::function<void (doris::PlanFragmentExecutor*)>)
(this=this@entry=0x7fb8d0e25000, exec_state=..., cb=...)
at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:1290
#13 0x000055a56c755802 in operator() (__closure=<optimized out>) at
/var/local/ldb-toolchain/include/c++/11/ext/atomicity.h:109
#14 std::__invoke_impl<void, doris::FragmentMgr::exec_plan_fragment(const
doris::TExecPlanFragmentParams&,
doris::FragmentMgr::FinishCallback)::<lambda()>&> (__f=...) at
/var/local/ldb-toolchain/include/c++/11/bits/invoke.h:61
#15 std::__invoke_r<void, doris::FragmentMgr::exec_plan_fragment(const
doris::TExecPlanFragmentParams&,
doris::FragmentMgr::FinishCallback)::<lambda()>&> (__fn=...) at
/var/local/ldb-toolchain/include/c++/11/bits/invoke.h:111
#16 std::_Function_handler<void(),
doris::FragmentMgr::exec_plan_fragment(const doris::TExecPlanFragmentParams&,
doris::FragmentMgr::FinishCallback)::<lambda()> >::_M_invoke(const
std::_Any_data &) (__functor=...)
at /var/local/ldb-toolchain/include/c++/11/bits/std_function.h:291
#17 0x000055a56ca0d475 in std::function<void ()>::operator()() const
(this=<optimized out>)
at /var/local/ldb-toolchain/include/c++/11/bits/std_function.h:556
#18 doris::FunctionRunnable::run (this=<optimized out>) at
/data/doris-1.x/be/src/util/threadpool.cpp:46
#19 doris::ThreadPool::dispatch_thread (this=0x7fb933ceea80) at
/data/doris-1.x/be/src/util/threadpool.cpp:535
#20 0x000055a56ca028cf in std::function<void ()>::operator()() const
(this=0x7faf3395f078)
at /var/local/ldb-toolchain/include/c++/11/bits/std_function.h:556
#21 doris::Thread::supervise_thread (arg=0x7faf3395f060) at
/data/doris-1.x/be/src/util/thread.cpp:454
#22 0x00007fb96ac11ea5 in start_thread () from /lib64/libpthread.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#23 0x00007fb96af249fd in clone () from /lib64/libc.so.6
```
The program is segmentation fault at memory address 0x725f74756f656dd1.
```
(gdb) x 0x725f74756f656dd1
0x725f74756f656dd1: Cannot access memory at address 0x725f74756f656dd1
```
Actually this address is not a mutex address, but a wrong address causes by
memory out of bound.
Here is the prove:
```
(gdb) f 6
#6 doris::vectorized::ScannerScheduler::submit (this=<optimized out>,
ctx=0x7fb470295e00)
at /data/doris-1.x/be/src/vec/exec/scan/scanner_scheduler.cpp:100
100 if (!_pending_queues[ctx->queue_idx]->blocking_put(ctx)) {
(gdb) p ctx->queue_idx
$2 = -3
(gdb) f 7
#7 0x000055a5716c9360 in doris::vectorized::VScanNode::_start_scanners
(this=this@entry=0x7fb389e17000, scanners=...)
at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:1290
1290 /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h: No
such file or directory.
(gdb) p _state->_exec_env->_scanner_scheduler->_pending_queues[-3]->_lock
Cannot access memory at address 0x725f74756f656dd1
```
The program access to address 0x725f74756f656dd1 because the queue_idx is
set to -3.
And the reason is the atomic_int _queue_idx in ScannerScheduler is
increment overflow to be negative.
So we should use atomic_uint to avoid overflow behavior.
<!--Describe your changes.-->
## Further comments
If this is a relatively large or complex change, kick off the discussion at
[[email protected]](mailto:[email protected]) by explaining why you
chose the solution you did and what alternatives you considered, etc...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]