[
https://issues.apache.org/jira/browse/IMPALA-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238456#comment-17238456
]
Bikramjeet Vig commented on IMPALA-10348:
-----------------------------------------
Update:
I compiled a timeline of events from coordinator, statestore, and the test logs
to get an idea of what was happening. Seems like the last executor started at
20:11:49 but was not recognized by the coordinator till 20:15:20 which caused
the query that queued at 20:11:45 to timeout at 20:12:45.
Not sure what was holding up the coordinator to recognize and update its
membership view. will have to dig in more
{noformat}
W1119 20:11:45.731312 20667 executor-group.cc:164]
e64b11989b018148:20eb7ce500000000] Executor group default-pool-group1 is
unhealthy: 1 out of 3 are available.
I1119 20:11:45.731348 20667 admission-controller.cc:1210]
e64b11989b018148:20eb7ce500000000] Queuing, query
id=e64b11989b018148:20eb7ce500000000 reason: Waiting for executors to start.
Only DDL queries and queries scheduled only on the coordinator (either
NUM_NODES set to 1 or when small query optimization is triggered) can currently
run.
I1119 20:11:49.496661 21016 statestore.cc:650] Subscriber
'impa...@impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com:27002'
registered (registration id: 054017c8c8b88935:a73adf2ac1968294)
W1119 20:11:49.542403 20040 executor-group.cc:164] Executor group
default-pool-group1 is unhealthy: 2 out of 3 are available.
20:11:54 MainThread: Found 4 impalad/1 statestored/1 catalogd process(es)
20:11:54 MainThread: Getting num_known_live_backends from
impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com:25000
20:11:54 MainThread: Waiting for num_known_live_backends=4. Current value: 3
I1119 20:12:45.731456 20667 admission-controller.cc:1272]
e64b11989b018148:20eb7ce500000000] Admission for query exceeded timeout 60000ms
in pool default-pool. Queued reason: Waiting for executors to start. Only DDL
queries and queries scheduled only on the coordinator (either NUM_NODES set to
1 or when small query optimization is triggered) can currently run. Additional
Details: Not Applicable
20:15:20 MainThread: Getting num_known_live_backends from
impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com:25000
20:15:20 MainThread: num_known_live_backends has reached value: 4
{noformat}
> test_sequential_startup_wait failed due to query admission exceeding timeout
> 60000ms
> ------------------------------------------------------------------------------------
>
> Key: IMPALA-10348
> URL: https://issues.apache.org/jira/browse/IMPALA-10348
> Project: IMPALA
> Issue Type: Bug
> Reporter: Fang-Yu Rao
> Assignee: Bikramjeet Vig
> Priority: Major
> Labels: broken-build, flaky
> Attachments:
> catalogd.impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com.jenkins.log.ERROR.20201119-201129.19576,
>
> catalogd.impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com.jenkins.log.INFO.20201119-201129.19576,
>
> catalogd.impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com.jenkins.log.WARNING.20201119-201129.19576,
>
> impalad.impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com.jenkins.log.ERROR.20201119-201129.19631,
>
> impalad.impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com.jenkins.log.INFO.20201119-201129.19631,
>
> impalad.impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com.jenkins.log.WARNING.20201119-201129.19631,
>
> statestored.impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com.jenkins.log.ERROR.20201119-201129.19560,
>
> statestored.impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com.jenkins.log.INFO.20201119-201129.19560,
>
> statestored.impala-ec2-centos74-m5-4xlarge-ondemand-0d33.vpc.cloudera.com.jenkins.log.WARNING.20201119-201129.19560
>
>
> We found in a recent s3 build that the test of
> {{test_sequential_startup_wait}} failed due to admission for query exceeding
> timeout 60000ms.
> The error message in the console output of the Jenkins job is the following.
> {noformat}
> Error Message:
> E Query aborted:Admission for query exceeded timeout 60000ms in pool
> default-pool. Queued reason: Waiting for executors to start. Only DDL queries
> and queries scheduled only on the coordinator (either NUM_NODES set to 1 or
> when small query optimization is triggered) can currently run. Additional
> Details: Not Applicable
> {noformat}
> Before the timeout was reached we saw the following entries in the
> corresponding log file produced by an impalad.
> {noformat}
> W1119 20:11:45.731312 20667 executor-group.cc:164]
> e64b11989b018148:20eb7ce500000000] Executor group default-pool-group1 is
> unhealthy: 1 out of 3 are available.
> W1119 20:11:45.731338 20667 admission-controller.cc:1558]
> e64b11989b018148:20eb7ce500000000] Waiting for executors to start. Only DDL
> queries and queries scheduled only on the coordinator (either NUM_NODES set
> to 1 or when small query optimization is triggered) can currently run.
> I1119 20:11:45.731348 20667 admission-controller.cc:1210]
> e64b11989b018148:20eb7ce500000000] Queuing, query
> id=e64b11989b018148:20eb7ce500000000 reason: Waiting for executors to start.
> Only DDL queries and queries scheduled only on the coordinator (either
> NUM_NODES set to 1 or when small query optimization is triggered) can
> currently run.
> I1119 20:11:45.773303 20040 admission-controller.cc:1876] Could not dequeue
> query id=e64b11989b018148:20eb7ce500000000 reason: Waiting for executors to
> start. Only DDL queries and queries scheduled only on the coordinator (either
> NUM_NODES set to 1 or when small query optimization is triggered) can
> currently run.
> {noformat}
> The corresponding log files are also provided.
> The test was recently revised in IMPALA-8830, maybe [~bikramjeet.vig] could
> provide some insight into it. Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]