[
https://issues.apache.org/jira/browse/IMPALA-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912978#comment-17912978
]
Laszlo Gaal commented on IMPALA-13669:
--------------------------------------
Managed to catch an instance of the bug with the build machine online. Running
a stack dump on the hung process yielded the following call stack:
{code}
(gdb) info threads
Id Target Id Frame
* 1 Thread 0xffffb8c4c040 (LWP 894818) 0x0000ffffb7aeecc4 in
__futex_abstimed_wait64 () from /lib64/libc.so.6
(gdb) bt
#0 0x0000ffffb7aeecc4 in __futex_abstimed_wait64 () from /lib64/libc.so.6
#1 0x0000ffffb7af8f80 in pthread_rwlock_wrlock@GLIBC_2.17 () from
/lib64/libc.so.6
#2 0x0000000003632440 in glog_internal_namespace_::Mutex::Lock (this=0x556f430
<google::log_mutex>) at src/base/mutex.h:250
#3 glog_internal_namespace_::MutexLock::MutexLock (mu=0x556f430
<google::log_mutex>, this=<synthetic pointer>) at src/base/mutex.h:290
#4 google::LogMessage::Flush (this=0xffffc5250608) at src/logging.cc:1335
#5 0x0000000003634c30 in google::LogMessageFatal::~LogMessageFatal
(this=<optimized out>, __in_chrg=<optimized out>) at src/logging.cc:2048
#6 0x0000000000f849e4 in impala::BufferPool::Client::MoveToDirtyUnpinned
(this=<optimized out>, page=page@entry=0x2da57e00) at
/data/jenkins/workspace/impala-private-basic-parameterized/repos/Impala/be/src/runtime/bufferpool/buffer-pool.cc:503
#7 0x0000000000f84cfc in impala::BufferPool::Unpin
(this=this@entry=0x4177c200, client=<optimized out>,
client@entry=0xffffc52507a8, handle=0x2e60cc80) at
/data/jenkins/workspace/impala-private-basic-parameterized/repos/Impala/be/src/runtime/bufferpool/buffer-pool.cc:210
#8 0x0000000000f50048 in
impala::BufferPoolTest_ScratchLimitZero_Test::TestBody (this=0x2dad3000) at
/data0/jenkins/workspace/impala-private-basic-parameterized/Impala-Toolchain/toolchain-packages-gcc10.4.0/gcc-10.4.0/include/c++/10.4.0/bits/stl_vector.h:1168
#9 0x00000000037e77f8 in
testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>
(location=0x3f4f198 "the test body", method=<optimized out>, object=0x2dad3000)
at /mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2612
#10 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>
(object=object@entry=0x2dad3000, method=<optimized out>,
location=location@entry=0x3f4f198 "the test body") at
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2648
#11 0x00000000037cc468 in testing::Test::Run (this=0x2dad3000) at
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2687
#12 testing::Test::Run (this=0x2dad3000) at
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2677
#13 0x00000000037cc608 in testing::TestInfo::Run (this=0x27c32c60) at
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2836
#14 0x00000000037cc8a4 in testing::TestSuite::Run (this=0x27c3e240) at
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:3015
#15 testing::TestSuite::Run (this=0x27c3e240) at
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2968
#16 0x00000000037df65c in testing::internal::UnitTestImpl::RunAllTests
(this=this@entry=0x27c2e000) at
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:5920
#17 0x00000000037cc9d8 in
testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl,
bool> (location=0x3f4f248 "auxiliary test code (environments or event
listeners)", method=<optimized out>, object=0x27c2e000) at
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2601
#18
testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl,
bool> (location=0x3f4f248 "auxiliary test code (environments or event
listeners)", method=<optimized out>, object=0x27c2e000) at
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2648
#19 testing::UnitTest::Run (this=0x575ad50
<testing::UnitTest::GetInstance()::instance>) at
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:5484
#20 0x0000000000f34be4 in RUN_ALL_TESTS () at
/data/jenkins/workspace/impala-private-basic-parameterized/Impala-Toolchain/toolchain-packages-gcc10.4.0/googletest-1.14.0/include/gtest/gtest.h:2317
#21 main (argc=<optimized out>, argv=<optimized out>) at
/data/jenkins/workspace/impala-private-basic-parameterized/repos/Impala/be/src/runtime/bufferpool/buffer-pool-test.cc:2521
{code}
(This was captured on ARM)
Listing the call environment for frame #6:
{code}
(gdb) f 6
#6 0x0000000000f849e4 in impala::BufferPool::Client::MoveToDirtyUnpinned
(this=<optimized out>, page=page@entry=0x2da57e00) at
/data/jenkins/workspace/impala-private-basic-parameterized/repos/Impala/be/src/runtime/bufferpool/buffer-pool.cc:503
503 DCHECK(spilling_enabled());
(gdb) list
498 handle->Reset();
499 }
500
501 void BufferPool::Client::MoveToDirtyUnpinned(Page* page) {
502 // Only valid to unpin pages if spilling is enabled.
503 DCHECK(spilling_enabled());
504 DCHECK_EQ(0, page->pin_count);
505
506 unique_lock<mutex> lock(lock_);
507 DCHECK_CONSISTENCY();
{code}
suggests that the hang might be happening in GLog.
> buffer-pool-test hangs on Rocky 9
> ---------------------------------
>
> Key: IMPALA-13669
> URL: https://issues.apache.org/jira/browse/IMPALA-13669
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.5.0
> Reporter: Laszlo Gaal
> Assignee: Laszlo Gaal
> Priority: Critical
>
> Recent test runs on Rocky Linux 9.2 often resulted in an hang in
> {{buffer-pool-test}} during BE tests. The hangs were observed only on Rocky
> 9, and they were seen on Intel and ARM CPUs both.
> When the hang occurs, it is only resolved by the test run's internal watchdog
> timing out at 20 hours, killing the build.
> Example runs:
> * https://jenkins.impala.io/job/rocky-9.2-from-scratch-ARM/4/ (ARM)
> * https://jenkins.impala.io/job/rocky-9.2-from-scratch/9/ (Intel)
> Multiple occurrences were observed in private environments as well.
> Marking as P2 (critical), as it doesn't block precommit runs, but makes it
> impossible to make progress with Rocky 9 / RHEL 9 support.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]