[
https://issues.apache.org/jira/browse/IMPALA-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403439#comment-17403439
]
Bikramjeet Vig commented on IMPALA-10877:
-----------------------------------------
Looked at the logs, seems like a scan fragment was stuck probably due to the
sleep in the query and was taking up a minimal amount of memory on one of the
executors. This prevented the query from being admitted as the mem_limit was
exactly equal to the process mem_limit. *To fix this flakiness*, we can reduce
the mem_limit to something slightly smaller so that it still only lets a single
query run at the same time have enough slack for a stuck fragment to finish up
before getting cancelled.
Attempt to admit query:
{noformat}
f446176d7cdc1a47:960d363400000000] Stats: agg_num_running=0, agg_num_queued=0,
agg_mem_reserved=4.94 MB, local_host(local_mem_admitted=0,
num_admitted_running=0, num_queued=0, backend_mem_reserved=4.00 MB,
topN_query_stats: queries=[c54c69cff22d2536:aae16d2c00000000],
total_mem_consumed=4.00 MB, fraction_of_pool_total_mem=1; pool_level_stats:
num_running=1, min=4.00 MB, max=4.00 MB, pool_total_mem=4.00 MB,
average_per_query=4.00 MB)
{noformat}
c54c69cff22d2536:aae16d2c00000000 is the stuck query taking up only 4MB of mem.
Hence we get the queued reason as follows:
{noformat}
Could not dequeue query id=f446176d7cdc1a47:960d363400000000 reason: Not enough
memory available on host
impala-ec2-centos74-m5-4xlarge-ondemand-1daa.vpc.cloudera.com:27003. Needed
4.00 GB but only 4.00 GB out of 4.00 GB was available.
{noformat}
> test_admission_control_with_multiple_coords fails due to an assert
> ------------------------------------------------------------------
>
> Key: IMPALA-10877
> URL: https://issues.apache.org/jira/browse/IMPALA-10877
> Project: IMPALA
> Issue Type: Bug
> Reporter: Abhishek Rawat
> Assignee: Bikramjeet Vig
> Priority: Major
>
> The testcase fails due to following assert:
> {code:java}
> custom_cluster/test_executor_groups.py:579: in
> test_admission_control_with_multiple_coords
> "admission-controller.agg-num-running.default-pool", 1, timeout=30)
> common/impala_service.py:143: in wait_for_metric_value
> self.__metric_timeout_assert(metric_name, expected_value, timeout)
> common/impala_service.py:210: in __metric_timeout_assert
> assert 0, assert_string
> E AssertionError: Metric admission-controller.agg-num-running.default-pool
> did not reach value 1 in 30s.
> E Dumping debug webpages in JSON format...
> E Dumped memz JSON to
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/memz.json
> E Dumped metrics JSON to
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/metrics.json
> E Dumped queries JSON to
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/queries.json
> E Dumped sessions JSON to
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/sessions.json
> E Dumped threadz JSON to
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/threadz.json
> E Dumped rpcz JSON to
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/rpcz.json
> E Dumping minidumps for impalads/catalogds...
> E Dumped minidump for Impalad PID 8103
> E Dumped minidump for Impalad PID 8106
> E Dumped minidump for Impalad PID 10328
> E Dumped minidump for Impalad PID 10331
> E Dumped minidump for Catalogd PID 8041
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]