[jira] [Comment Edited] (IMPALA-8027) KRPC datastream timing out on both the receiver and sender side even in a minicluster

Michael Ho (JIRA) Tue, 26 Feb 2019 20:22:02 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749308#comment-16749308
 ]


Michael Ho edited comment on IMPALA-8027 at 2/27/19 4:20 AM:
-------------------------------------------------------------

The profile didn't suggest much either. The last report time for the 
coordinator fragment was 05:16:55, which was when the profile was initially 
created. So, the coordinator itself didn't receive any update from the 
coordinator fragment itself for two minutes. This suggests the Prepare() didn't 
complete within the time frame.

{noformat}
Instance 194f5b70907ac97c:84a116d600000000 
(host=impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22000)
Last report received time: 2018-12-28 05:16:52.214
{noformat}

FWIW, other fragment instances running on other Impalad peers were able to send 
report so it seems the status reporting handling was working on coordinator 
({{impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22000}}):
{noformat}
Instance 194f5b70907ac97c:84a116d600000003 
(host=impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22001)
Last report received time: 2018-12-28 05:18:56.204
Hdfs split stats (<volume id>:<# splits>/<split lengths>): -1:4/152.76 KB
{noformat}

The fragment instances running on 
{{impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22000}} 
eventually completed.
{noformat}
I1228 05:16:52.217275 26685 query-state.cc:568] Executing instance. 
instance_id=194f5b70907ac97c:84a116d600000000 fragment_idx=0 
per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=7
I1228 05:20:08.981983 26685 krpc-data-stream-mgr.cc:294] DeregisterRecvr(): 
fragment_instance_id=194f5b70907ac97c:84a116d600000000, node=2
I1228 05:20:08.982043 26685 query-state.cc:576] Instance completed. 
instance_id=194f5b70907ac97c:84a116d600000000 #in-flight=5 status=CANCELLED: 
Cancelled
{noformat}

{noformat}
I1228 05:16:52.217344 26686 query-state.cc:568] Executing instance. 
instance_id=194f5b70907ac97c:84a116d600000001 fragment_idx=1 
per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=8
I1228 05:20:08.992434 26686 llvm-codegen.cc:1104] For query 
194f5b70907ac97c:84a116d600000000 the following functions were not finalized 
and have been removed from the module:
I1228 05:20:09.049794 26686 query-state.cc:576] Instance completed. 
instance_id=194f5b70907ac97c:84a116d600000001 #in-flight=5 status=CANCELLED: 
Cancelled
I1228 05:20:09.056677 26686 query-exec-mgr.cc:182] ReleaseQueryState(): deleted 
query_id=194f5b70907ac97c:84a116d600000000
{noformat}

With the data available so far, one could only conclude that the test failed 
due to some flaky scheduling issue which slowed down some query fragments 
running on 
{{impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22000}}. Given 
this hasn't happened again since the initial reporting, I don't think there is 
much we need to do at this point.

Longer term, fixing IMPALA-6818 will remove this class of flaky timeout issues.


was (Author: kwho):
The profile didn't suggest much either. The last report time for the 
coordinator fragment was 05:16:55, which was when the profile was initially 
created. So, the coordinator itself didn't receive any update from the 
coordinator fragment itself for two minutes. This suggests the Prepare() didn't 
complete within the time frame.

{noformat}
Instance 194f5b70907ac97c:84a116d600000000 
(host=impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22000)
Last report received time: 2018-12-28 05:16:52.214
{noformat}

FWIW, other fragment instances running on other Impalad peers were able to send 
report so it seems the status reporting handling was working on coordinator 
({{impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22000}}):
{noformat}
Instance 194f5b70907ac97c:84a116d600000003 
(host=impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22001)
Last report received time: 2018-12-28 05:18:56.204
Hdfs split stats (<volume id>:<# splits>/<split lengths>): -1:4/152.76 KB
{noformat}

The fragment instances running on 
{{impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22000}} 
eventually completed.
{noformat}
I1228 05:16:52.217275 26685 query-state.cc:568] Executing instance. 
instance_id=194f5b70907ac97c:84a116d600000000 fragment_idx=0 
per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=7
I1228 05:20:08.981983 26685 krpc-data-stream-mgr.cc:294] DeregisterRecvr(): 
fragment_instance_id=194f5b70907ac97c:84a116d600000000, node=2
I1228 05:20:08.982043 26685 query-state.cc:576] Instance completed. 
instance_id=194f5b70907ac97c:84a116d600000000 #in-flight=5 status=CANCELLED: 
Cancelled
{noformat}

{noformat}
I1228 05:16:52.217344 26686 query-state.cc:568] Executing instance. 
instance_id=194f5b70907ac97c:84a116d600000001 fragment_idx=1 
per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=8
I1228 05:20:08.992434 26686 llvm-codegen.cc:1104] For query 
194f5b70907ac97c:84a116d600000000 the following functions were not finalized 
and have been removed from the module:
I1228 05:20:09.049794 26686 query-state.cc:576] Instance completed. 
instance_id=194f5b70907ac97c:84a116d600000001 #in-flight=5 status=CANCELLED: 
Cancelled
I1228 05:20:09.056677 26686 query-exec-mgr.cc:182] ReleaseQueryState(): deleted 
query_id=194f5b70907ac97c:84a116d600000000
{noformat}

With the data available so far, one could only conclude that the test failed 
due to some flaky scheduling issue which slowed down some query fragments 
running on 
{{impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22000}}. Given 
this hasn't happened again since the initial reporting, I don't think there is 
much we need to do at this point.

Longer term, fixing IMPALA-6618 will remove this class of flaky timeout issues.

> KRPC datastream timing out on both the receiver and sender side even in a 
> minicluster
> -------------------------------------------------------------------------------------
>
>                 Key: IMPALA-8027
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8027
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 3.2.0
>            Reporter: Bikramjeet Vig
>            Assignee: Michael Ho
>            Priority: Critical
>              Labels: broken-build
>
> krpc datastreams seem to time out at the same time at both sender and 
> receiver causing two running queries to fail. This happened while running 
> core tests on s3.
> Logs from coordinator:
> {noformat}
> I1228 05:18:56.202587 13396 krpc-data-stream-mgr.cc:353] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> 8f46b2518734bef1:6ef2d40400000000, dest node: 2
> I1228 05:18:56.203061 13396 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:53118 (request call id 
> 11274) took 120782ms. Request Metrics: {}
> I1228 05:18:56.203114 13396 krpc-data-stream-mgr.cc:353] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> 194f5b70907ac97c:84a116d600000000, dest node: 2
> I1228 05:18:56.203136 13396 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:53110 (request call id 
> 8637) took 123811ms. Request Metrics: {}
> I1228 05:18:56.203155 13396 krpc-data-stream-mgr.cc:353] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> 194f5b70907ac97c:84a116d600000000, dest node: 2
> I1228 05:18:56.203167 13396 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:53118 (request call id 
> 11273) took 123776ms. Request Metrics: {}
> I1228 05:18:56.203181 13396 krpc-data-stream-mgr.cc:408] Reduced stream ID 
> cache from 413 items, to 410, eviction took: 1ms
> I1228 05:18:56.204746 13377 coordinator.cc:707] Backend completed: 
> host=impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22001 
> remaining=2 query_id=8f46b2518734bef1:6ef2d40400000000
> I1228 05:18:56.204756 13377 coordinator-backend-state.cc:262] 
> query_id=8f46b2518734bef1:6ef2d40400000000: first in-progress backend: 
> impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22000
> I1228 05:18:56.204769 13377 coordinator.cc:522] ExecState: query 
> id=8f46b2518734bef1:6ef2d40400000000 
> finstance=8f46b2518734bef1:6ef2d40400000001 on 
> host=impala-ec2-centos74-m5-4xlarge-ondemand-07b3.vpc.cloudera.com:22001 
> (EXECUTING -> ERROR) status=Sender 127.0.0.1 timed out waiting for receiver 
> fragment instance: 8f46b2518734bef1:6ef2d40400000000, dest node: 2
> {noformat}
> Logs from executor:
> {noformat}
> E1228 05:18:56.203181 26715 krpc-data-stream-sender.cc:343] channel send to 
> 127.0.0.1:27000 failed: 
> (fragment_instance_id=8f46b2518734bef1:6ef2d40400000000): Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> 8f46b2518734bef1:6ef2d40400000000, dest node: 2
> E1228 05:18:56.203256 26682 krpc-data-stream-sender.cc:343] channel send to 
> 127.0.0.1:27000 failed: 
> (fragment_instance_id=194f5b70907ac97c:84a116d600000000): Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> 194f5b70907ac97c:84a116d600000000, dest node: 2
> I1228 05:18:56.203451 26715 query-state.cc:576] Instance completed. 
> instance_id=8f46b2518734bef1:6ef2d40400000001 #in-flight=3 
> status=DATASTREAM_SENDER_TIMEOUT: Sender 127.0.0.1 timed out waiting for 
> receiver fragment instance: 8f46b2518734bef1:6ef2d40400000000, dest node: 2
> I1228 05:18:56.203485 26713 query-state.cc:249] UpdateBackendExecState(): 
> last report for 8f46b2518734bef1:6ef2d40400000000
> I1228 05:18:56.203514 26682 query-state.cc:576] Instance completed. 
> instance_id=194f5b70907ac97c:84a116d600000003 #in-flight=2 
> status=DATASTREAM_SENDER_TIMEOUT: Sender 127.0.0.1 timed out waiting for 
> receiver fragment instance: 194f5b70907ac97c:84a116d600000000, dest node: 2
> I1228 05:18:56.203536 26680 query-state.cc:249] UpdateBackendExecState(): 
> last report for 194f5b70907ac97c:84a116d600000000
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (IMPALA-8027) KRPC datastream timing out on both the receiver and sender side even in a minicluster

Reply via email to