[ 
https://issues.apache.org/jira/browse/IMPALA-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403439#comment-17403439
 ] 

Bikramjeet Vig commented on IMPALA-10877:
-----------------------------------------

Looked at the logs, seems like a scan fragment was stuck probably due to the 
sleep in the query and was taking up a minimal amount of memory on one of the 
executors. This prevented the query from being admitted as the mem_limit was 
exactly equal to the process mem_limit. *To fix this flakiness*, we can reduce 
the mem_limit to something slightly smaller so that it still only lets a single 
query run at the same time have enough slack for a stuck fragment  to finish up 
before getting cancelled.

Attempt to admit query:
{noformat}
f446176d7cdc1a47:960d363400000000] Stats: agg_num_running=0, agg_num_queued=0, 
agg_mem_reserved=4.94 MB,  local_host(local_mem_admitted=0, 
num_admitted_running=0, num_queued=0, backend_mem_reserved=4.00 MB, 
topN_query_stats: queries=[c54c69cff22d2536:aae16d2c00000000], 
total_mem_consumed=4.00 MB, fraction_of_pool_total_mem=1; pool_level_stats: 
num_running=1, min=4.00 MB, max=4.00 MB, pool_total_mem=4.00 MB, 
average_per_query=4.00 MB)
{noformat}
c54c69cff22d2536:aae16d2c00000000 is the stuck query taking up only 4MB of mem. 
Hence we get the queued reason as follows:
{noformat}
Could not dequeue query id=f446176d7cdc1a47:960d363400000000 reason: Not enough 
memory available on host 
impala-ec2-centos74-m5-4xlarge-ondemand-1daa.vpc.cloudera.com:27003. Needed 
4.00 GB but only 4.00 GB out of 4.00 GB was available.
{noformat}

> test_admission_control_with_multiple_coords fails due to an assert
> ------------------------------------------------------------------
>
>                 Key: IMPALA-10877
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10877
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Abhishek Rawat
>            Assignee: Bikramjeet Vig
>            Priority: Major
>
> The testcase fails due to following assert:
> {code:java}
> custom_cluster/test_executor_groups.py:579: in 
> test_admission_control_with_multiple_coords
>     "admission-controller.agg-num-running.default-pool", 1, timeout=30)
> common/impala_service.py:143: in wait_for_metric_value
>     self.__metric_timeout_assert(metric_name, expected_value, timeout)
> common/impala_service.py:210: in __metric_timeout_assert
>     assert 0, assert_string
> E   AssertionError: Metric admission-controller.agg-num-running.default-pool 
> did not reach value 1 in 30s.
> E   Dumping debug webpages in JSON format...
> E   Dumped memz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/memz.json
> E   Dumped metrics JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/metrics.json
> E   Dumped queries JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/queries.json
> E   Dumped sessions JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/sessions.json
> E   Dumped threadz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/threadz.json
> E   Dumped rpcz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20210819_21:39:45/json/rpcz.json
> E   Dumping minidumps for impalads/catalogds...
> E   Dumped minidump for Impalad PID 8103
> E   Dumped minidump for Impalad PID 8106
> E   Dumped minidump for Impalad PID 10328
> E   Dumped minidump for Impalad PID 10331
> E   Dumped minidump for Catalogd PID 8041
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to