[ https://issues.apache.org/jira/browse/KUDU-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15666133#comment-15666133 ]
Todd Lipcon commented on KUDU-1745: ----------------------------------- I actually just figured out the cluster based on the IP in the error message :) Here's one thing I find interesting in some of the impalad logs: {code} W1114 18:38:23.978837 9768 kernel_stack_watchdog.cc:144] Thread 24970 stuck at /data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-centos-6/toolchain/source/kudu/kudu-88b02349d88d335caac18bf8b930eac6d327ed40/src/kudu/rpc/outbound_call.cc:185 for 101086ms: Kernel stack: [<ffffffff810b23ba>] futex_wait_queue_me+0xba/0xf0 [<ffffffff810b3510>] futex_wait+0x1c0/0x310 [<ffffffff810b4e21>] do_futex+0x121/0xae0 [<ffffffff810b585b>] sys_futex+0x7b/0x170 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff {code} This message comes if the RPC callback is taking a long time - here it seems to be blocked for 101 seconds. I also see other messages preceding like this: {code} W1114 18:38:15.426391 25528 outbound_call.cc:199] RPC callback for RPC call kudu.tserver.TabletServerService.Write -> {remote=172.28.195.84:7050, user_credentials={real_user=impala}} blocked reactor thread for 1.09099e+08us W1114 18:38:15.426630 25528 connection.cc:205] RPC call timeout handler was delayed by 67.0671s! This may be due to a process-wide pause such as swapping, logging-related delays, or allocator lock contention. Will allow an additional 17.9996s for a response. {code} so something's blocking our threads for a really long time. In the context of Impala it's a little tricky to figure it out because we need SIGUSR2 in order to fill in the user-space stack. Would it be possible to try this workload again after adding a call to kudu::SetStackTraceSignal(...) with some signal that you don't use in Impala? eg maybe SIGRTMIN+1 or somesuch. I'm not 100% sure it's safe to use unless you compile your whole toolchain with -fno-omit-frame-pointer, but should be safe enough to turn on for now to help diagnose what's going on here. My main theory is that something is blocking the log disk for minutes on end, and that's causing a bunch of timeouts because we're trying to log stuff from the write callback. All of the above notwithstanding, we shouldn't crash :) Obviously there's a bug here as well, but would be good to understand why the thing is grinding our reactor threads to a halt in this stress scenario. > Kudu causes Impala to crash under stress > ---------------------------------------- > > Key: KUDU-1745 > URL: https://issues.apache.org/jira/browse/KUDU-1745 > Project: Kudu > Issue Type: Bug > Reporter: Taras Bobrovytsky > Priority: Critical > Attachments: hs_err_pid7761.log, hs_err_pid9275.log, stacks.out > > > There were over 200 queries running, about half of which were selects and the > rest were upsert and delete queries. > There was a crash after a few minutes with the following stack trace: > {code} > Stack: [0x00007f1629c93000,0x00007f162a694000], sp=0x00007f162a6922b0, free > space=10236k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > C [libstdc++.so.6+0xc5018] std::basic_string<char, std::char_traits<char>, > std::allocator<char> >::basic_string(std::string const&)+0x8 > C [libkudu_client.so.0+0x6f662] > _ZNSt6vectorIPN4kudu6client9KuduErrorESaIS3_EE19_M_emplace_back_auxIIS3_EEEvDpOT_+0x1dd2 > C [libkudu_client.so.0+0x4c872] _init+0xd022 > C [libkudu_client.so.0+0x4c9d6] _init+0xd186 > C [libkudu_client.so.0+0x70775] > _ZNSt6vectorIPN4kudu6client9KuduErrorESaIS3_EE19_M_emplace_back_auxIIS3_EEEvDpOT_+0x2ee5 > C [libkudu_client.so.0+0x754d9] > _ZNSt6vectorIPN4kudu6client9KuduErrorESaIS3_EE19_M_emplace_back_auxIIS3_EEEvDpOT_+0x7c49 > C [libkudu_client.so.0+0xc9be0] > kudu::client::KuduUpsert::~KuduUpsert()+0x2ae30 > C [libkudu_client.so.0+0xc9eb2] > kudu::client::KuduUpsert::~KuduUpsert()+0x2b102 > C [libkudu_client.so.0+0xcc73d] > _ZNSt6vectorIN4kudu5SliceESaIS1_EE19_M_emplace_back_auxIIS1_EEEvDpOT_+0x123d > C [libkudu_client.so.0+0xbe405] > kudu::client::KuduUpsert::~KuduUpsert()+0x1f655 > C [libkudu_client.so.0+0xcdc0c] > _ZNSt6vectorIN4kudu5SliceESaIS1_EE19_M_emplace_back_auxIIS1_EEEvDpOT_+0x270c > C [libkudu_client.so.0+0x25ec1b] > _ZNSt8_Rb_treeISsSt4pairIKSsSsESt10_Select1stIS2_ESt4lessISsESaIS2_EE22_M_emplace_hint_uniqueIJRKSt21piecewise_construct_tSt5tupleIJRS1_EESD_IJEEEEESt17_Rb_tree_iteratorIS2_ESt23_Rb_tree_const_iteratorIS2_EDpOT_+0x311b > C [libkudu_client.so.0+0x262324] > _ZNSt8_Rb_treeISsSt4pairIKSsSsESt10_Select1stIS2_ESt4lessISsESaIS2_EE22_M_emplace_hint_uniqueIJRKSt21piecewise_construct_tSt5tupleIJRS1_EESD_IJEEEEESt17_Rb_tree_iteratorIS2_ESt23_Rb_tree_const_iteratorIS2_EDpOT_+0x6824 > T_+0x558a > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)