[ 
https://issues.apache.org/jira/browse/KUDU-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15666133#comment-15666133
 ] 

Todd Lipcon commented on KUDU-1745:
-----------------------------------

I actually just figured out the cluster based on the IP in the error message :)

Here's one thing I find interesting in some of the impalad logs:

{code}
W1114 18:38:23.978837  9768 kernel_stack_watchdog.cc:144] Thread 24970 stuck at 
/data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-centos-6/toolchain/source/kudu/kudu-88b02349d88d335caac18bf8b930eac6d327ed40/src/kudu/rpc/outbound_call.cc:185
 for 101086ms:
Kernel stack:
[<ffffffff810b23ba>] futex_wait_queue_me+0xba/0xf0
[<ffffffff810b3510>] futex_wait+0x1c0/0x310
[<ffffffff810b4e21>] do_futex+0x121/0xae0
[<ffffffff810b585b>] sys_futex+0x7b/0x170
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
{code}

This message comes if the RPC callback is taking a long time - here it seems to 
be blocked for 101 seconds. I also see other messages preceding like this:

{code}
W1114 18:38:15.426391 25528 outbound_call.cc:199] RPC callback for RPC call 
kudu.tserver.TabletServerService.Write -> {remote=172.28.195.84:7050, 
user_credentials={real_user=impala}} blocked reactor thread for 1.09099e+08us
W1114 18:38:15.426630 25528 connection.cc:205] RPC call timeout handler was 
delayed by 67.0671s! This may be due to a process-wide pause such as swapping, 
logging-related delays, or allocator lock contention. Will allow an additional 
17.9996s for a response.
{code}

so something's blocking our threads for a really long time. In the context of 
Impala it's a little tricky to figure it out because we need SIGUSR2 in order 
to fill in the user-space stack. Would it be possible to try this workload 
again after adding a call to kudu::SetStackTraceSignal(...) with some signal 
that you don't use in Impala? eg maybe SIGRTMIN+1 or somesuch. I'm not 100% 
sure it's safe to use unless you compile your whole toolchain with 
-fno-omit-frame-pointer, but should be safe enough to turn on for now to help 
diagnose what's going on here.

My main theory is that something is blocking the log disk for minutes on end, 
and that's causing a bunch of timeouts because we're trying to log stuff from 
the write callback.

All of the above notwithstanding, we shouldn't crash :) Obviously there's a bug 
here as well, but would be good to understand why the thing is grinding our 
reactor threads to a halt in this stress scenario.


> Kudu causes Impala to crash under stress
> ----------------------------------------
>
>                 Key: KUDU-1745
>                 URL: https://issues.apache.org/jira/browse/KUDU-1745
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: Taras Bobrovytsky
>            Priority: Critical
>         Attachments: hs_err_pid7761.log, hs_err_pid9275.log, stacks.out
>
>
> There were over 200 queries running, about half of which were selects and the 
> rest were upsert and delete queries.
> There was a crash after a few minutes with the following stack trace:
> {code}
> Stack: [0x00007f1629c93000,0x00007f162a694000],  sp=0x00007f162a6922b0,  free 
> space=10236k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C  [libstdc++.so.6+0xc5018]  std::basic_string<char, std::char_traits<char>, 
> std::allocator<char> >::basic_string(std::string const&)+0x8
> C  [libkudu_client.so.0+0x6f662]  
> _ZNSt6vectorIPN4kudu6client9KuduErrorESaIS3_EE19_M_emplace_back_auxIIS3_EEEvDpOT_+0x1dd2
> C  [libkudu_client.so.0+0x4c872]  _init+0xd022
> C  [libkudu_client.so.0+0x4c9d6]  _init+0xd186
> C  [libkudu_client.so.0+0x70775]  
> _ZNSt6vectorIPN4kudu6client9KuduErrorESaIS3_EE19_M_emplace_back_auxIIS3_EEEvDpOT_+0x2ee5
> C  [libkudu_client.so.0+0x754d9]  
> _ZNSt6vectorIPN4kudu6client9KuduErrorESaIS3_EE19_M_emplace_back_auxIIS3_EEEvDpOT_+0x7c49
> C  [libkudu_client.so.0+0xc9be0]  
> kudu::client::KuduUpsert::~KuduUpsert()+0x2ae30
> C  [libkudu_client.so.0+0xc9eb2]  
> kudu::client::KuduUpsert::~KuduUpsert()+0x2b102
> C  [libkudu_client.so.0+0xcc73d]  
> _ZNSt6vectorIN4kudu5SliceESaIS1_EE19_M_emplace_back_auxIIS1_EEEvDpOT_+0x123d
> C  [libkudu_client.so.0+0xbe405]  
> kudu::client::KuduUpsert::~KuduUpsert()+0x1f655
> C  [libkudu_client.so.0+0xcdc0c]  
> _ZNSt6vectorIN4kudu5SliceESaIS1_EE19_M_emplace_back_auxIIS1_EEEvDpOT_+0x270c
> C  [libkudu_client.so.0+0x25ec1b]  
> _ZNSt8_Rb_treeISsSt4pairIKSsSsESt10_Select1stIS2_ESt4lessISsESaIS2_EE22_M_emplace_hint_uniqueIJRKSt21piecewise_construct_tSt5tupleIJRS1_EESD_IJEEEEESt17_Rb_tree_iteratorIS2_ESt23_Rb_tree_const_iteratorIS2_EDpOT_+0x311b
> C  [libkudu_client.so.0+0x262324]  
> _ZNSt8_Rb_treeISsSt4pairIKSsSsESt10_Select1stIS2_ESt4lessISsESaIS2_EE22_M_emplace_hint_uniqueIJRKSt21piecewise_construct_tSt5tupleIJRS1_EESD_IJEEEEESt17_Rb_tree_iteratorIS2_ESt23_Rb_tree_const_iteratorIS2_EDpOT_+0x6824
> T_+0x558a
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to