[ 
https://issues.apache.org/jira/browse/IMPALA-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669410#comment-16669410
 ] 

Michael Brown commented on IMPALA-7736:
---------------------------------------

I finally got time to come back to this, and I have a way to reproduce this 
with some instrumented logging to see what's going on. Kind of a pain since I 
have to wait for some time, but anyway....

The bug is under counting of {{self._num_queries_started_running_or_cancelled}}.

{noformat}
(Pdb) p self._num_queries_started_running_or_cancelled.value
213
(Pdb) p self._num_queries_finished.value
218
(Pdb)
{noformat}

In my logfile, I found 5 queries that failed to increment 
{{self._num_queries_started_running_or_cancelled}} and they all failed like 
this:

{noformat}
ExecQueryFInstances rpc query_id=f4c264ac41953b9:f289dfd400000000 failed: 
Failed to get minimum memory reservation of 245.25 MB on daemon redacted:22000 
for query f4c264ac41953b9:f289dfd400000000 due to following error: Failed to 
increase reservation by 245.25 MB because it would exceed the applicable 
reservation limit for the "Process" ReservationTracker: reservation_limit=42.67 
GB reservation=42.64 GB used_reservation=0 child_reservations=42.64 GB
{noformat}

I instrumented concurrent_select.py to log when it's incrementing counters and 
did some post-processing on the logs to make it readable. A full query where 
this is going on:
{noformat}
2018-10-30 10:46:14,269 1427 Query Consumer 
DEBUG:concurrent_select[725]:Getting query_idx
2018-10-30 10:46:14,269 1427 Query Consumer 
DEBUG:concurrent_select[742]:Waiting for other query runners to start their 
queries
2018-10-30 11:00:54,832 1427 Query Consumer 
DEBUG:concurrent_select[748]:Requesting memory reservation
2018-10-30 11:01:04,350 1427 Query Consumer 
DEBUG:concurrent_select[382]:Reserved 636 MB; 53 MB available; 64198 MB 
overcommitted
2018-10-30 11:01:04,350 1427 Query Consumer 
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for 
c_ulong(186L)> with id 140453288913680
2018-10-30 11:01:04,351 1427 Query Consumer 
DEBUG:concurrent_select[750]:Received memory reservation
2018-10-30 11:01:04,351 1427 Query Consumer 
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(186)> 
with id self._num_queries_submitted  <--------------------------------- new 
DEBUG statement with post-processed logging for counters I care about
2018-10-30 11:01:04,363 1427 Query Consumer DEBUG:concurrent_select[1126]:Using 
tpcds_300_decimal_parquet database
2018-10-30 11:01:04,363 1427 Query Consumer DEBUG:db_connection[203]:IMPALA: 
USE tpcds_300_decimal_parquet
2018-10-30 11:01:04,394 1427 Query Consumer DEBUG:db_connection[203]:IMPALA: 
SET ABORT_ON_ERROR=1
2018-10-30 11:01:04,440 1427 Query Consumer 
DEBUG:concurrent_select[1143]:Setting mem limit to 636 MB
2018-10-30 11:01:04,440 1427 Query Consumer DEBUG:db_connection[203]:IMPALA: 
SET MEM_LIMIT=636M
2018-10-30 11:01:04,477 1427 Query Consumer 
DEBUG:concurrent_select[1147]:Running query with 636 MB mem limit at redacted5 
with timeout secs 45943.8717842:
-- RESULT MISMATCH FROM ORIGINAL
-- ADD ROUND()s TO 4th, 5th, 9th, 10th COLUMNS, TAKE ACTUAL RESULTS AS EXPECTED.
with inv as
(select w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy
       ,stdev,mean, case mean when 0 then null else stdev/mean end cov
 from(select w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy
            ,stddev_samp(inv_quantity_on_hand) stdev,avg(inv_quantity_on_hand) 
mean
      from inventory
          ,item
          ,warehouse
          ,date_dim
      where inv_item_sk = i_item_sk
        and inv_warehouse_sk = w_warehouse_sk
        and inv_date_sk = d_date_sk
        and d_year =2001
      group by w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy) foo
 where case mean when 0 then 0 else stdev/mean end > 1)
select inv1.w_warehouse_sk,inv1.i_item_sk,inv1.d_moy
        ,round(inv1.mean, 2), round(inv1.cov, 8)
        ,inv2.w_warehouse_sk,inv2.i_item_sk,inv2.d_moy
        ,round(inv2.mean, 2), round(inv2.cov, 8)
from inv inv1,inv inv2
where inv1.i_item_sk = inv2.i_item_sk
  and inv1.w_warehouse_sk =  inv2.w_warehouse_sk
  and inv1.d_moy=1
  and inv2.d_moy=1+1
order by inv1.w_warehouse_sk,inv1.i_item_sk,inv1.d_moy,inv1.mean,inv1.cov
        ,inv2.d_moy,inv2.mean, inv2.cov
;
2018-10-30 11:01:04,803 1427 Query Consumer DEBUG:concurrent_select[1106]:Error 
running query with id None: ExecQueryFInstances rpc 
query_id=f4c264ac41953b9:f289dfd400000000 failed: Failed to get minimum memory 
reservation of 245.25 MB on daemon redacted2:22000 for query 
f4c264ac41953b9:f289dfd400000000 due to following error: Failed to increase 
reservation by 245.25 MB because it would exceed the applicable reservation 
limit for the "Process" ReservationTracker: reservation_limit=42.67 GB 
reservation=42.64 GB used_reservation=0 child_reservations=42.64 GB
2018-10-30 11:01:04,805 1427 Query Consumer DEBUG:concurrent_select[760]:Got 
execution report for query
2018-10-30 11:01:04,805 1427 Query Consumer 
DEBUG:concurrent_select[868]:Updating runtime stats
2018-10-30 11:01:04,805 1427 Query Consumer 
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(142)> 
with id self._num_queries_finished
2018-10-30 11:01:04,805 1427 Query Consumer 
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(11)> 
with id 140453288911056
2018-10-30 11:01:04,806 1427 Query Consumer 
DEBUG:concurrent_select[395]:Released 636 MB; 678 MB available; 63573 MB 
overcommitted
{noformat}

We don't see an increment for 
{{self._num_queries_started_running_or_cancelled}}.

Take instead this other query, same thread even, that shows the increment:

{noformat}
2018-10-30 10:38:40,989 1427 Query Consumer DEBUG:concurrent_select[708]:New 
query runner started
2018-10-30 10:38:41,268 1427 Query Consumer 
DEBUG:concurrent_select[725]:Getting query_idx
2018-10-30 10:38:41,268 1427 Query Consumer 
DEBUG:concurrent_select[742]:Waiting for other query runners to start their 
queries
2018-10-30 10:38:41,268 1427 Query Consumer 
DEBUG:concurrent_select[748]:Requesting memory reservation
2018-10-30 10:38:41,268 1427 Query Consumer 
DEBUG:concurrent_select[382]:Reserved 279 MB; 51160 MB available; 13091 MB 
overcommitted
2018-10-30 10:38:41,268 1427 Query Consumer 
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for 
c_ulong(77L)> with id 140453288913680
2018-10-30 10:38:41,268 1427 Query Consumer 
DEBUG:concurrent_select[750]:Received memory reservation
2018-10-30 10:38:41,268 1427 Query Consumer 
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(77)> 
with id self._num_queries_submitted
2018-10-30 10:38:41,377 1427 Query Consumer DEBUG:concurrent_select[1126]:Using 
tpcds_300_decimal_parquet database
2018-10-30 10:38:41,377 1427 Query Consumer DEBUG:db_connection[203]:IMPALA: 
USE tpcds_300_decimal_parquet
2018-10-30 10:38:42,718 1427 Query Consumer DEBUG:db_connection[203]:IMPALA: 
SET ABORT_ON_ERROR=1
2018-10-30 10:38:43,468 1427 Query Consumer 
DEBUG:concurrent_select[1143]:Setting mem limit to 279 MB
2018-10-30 10:38:43,468 1427 Query Consumer DEBUG:db_connection[203]:IMPALA: 
SET MEM_LIMIT=279M
2018-10-30 10:38:44,324 1427 Query Consumer 
DEBUG:concurrent_select[1147]:Running query with 279 MB mem limit at redacted5 
with timeout secs 8091.05727196:
2018-10-30 10:38:51,970 1427 Query Consumer DEBUG:concurrent_select[1086]:Query 
id is 514ecaffd5cce5e1:e693d96200000000
2018-10-30 10:38:51,971 1427 Query Consumer 
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(86)> 
with id self._num_queries_started_running_or_cancelled 
<------------------------------------------------ this is correctly implemented
2018-10-30 10:38:57,259 1427 Query Consumer 
DEBUG:concurrent_select[1182]:Waiting for query to execute
2018-10-30 10:46:12,321 1427 Fetch Results 514ecaffd5cce5e1:e693d96200000000 
DEBUG:concurrent_select[1260]:Fetching result for query with id 
514ecaffd5cce5e1:e693d96200000000
2018-10-30 10:46:12,620 1427 Fetch Results 514ecaffd5cce5e1:e693d96200000000 
DEBUG:concurrent_select[1260]:Fetching result for query with id 
514ecaffd5cce5e1:e693d96200000000
2018-10-30 10:46:12,622 1427 Fetch Results 514ecaffd5cce5e1:e693d96200000000 
DEBUG:concurrent_select[1266]:No more results for query with id 
514ecaffd5cce5e1:e693d96200000000
2018-10-30 10:46:12,641 1427 Query Consumer INFO:hiveserver2[265]:Closing 
active operation
2018-10-30 10:46:14,268 1427 Query Consumer DEBUG:concurrent_select[760]:Got 
execution report for query
2018-10-30 10:46:14,268 1427 Query Consumer 
DEBUG:concurrent_select[868]:Updating runtime stats
2018-10-30 10:46:14,269 1427 Query Consumer 
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(63)> 
with id self._num_queries_finished
2018-10-30 10:46:14,269 1427 Query Consumer 
DEBUG:concurrent_select[395]:Released 279 MB; 1431 MB available; 62820 MB 
overcommitted
{noformat}

> stress: AssertionError: The number of executing queries is negative
> -------------------------------------------------------------------
>
>                 Key: IMPALA-7736
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7736
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.1.0
>            Reporter: Michael Brown
>            Priority: Critical
>              Labels: stress
>
> I'd heard about this anecdotally but haven't seen it until now:
> {noformat}
> Done | Active | Executing | Mem Lmt Ex | AC Reject | AC Timeout | Time Out | 
> Cancel | Err | Incorrect | Next Qry Mem Lmt | Tot Qry Mem Lmt | Tracked Mem | 
> RSS Mem
>  263 |      6 |         3 |         20 |         0 |          0 |        0 |  
>    17 |   0 |         0 |             9818 |           26802 |       15656 |  
>  19329
>  263 |      6 |         3 |         20 |         0 |          0 |        0 |  
>    17 |   0 |         0 |             9818 |           26802 |       15534 |  
>  19341
>  263 |      6 |         3 |         20 |         0 |          0 |        0 |  
>    17 |   0 |         0 |             9818 |           26802 |       15462 |  
>  19190
>  265 |      4 |         1 |         20 |         0 |          0 |        0 |  
>    17 |   0 |         0 |             9818 |           22938 |       15482 |  
>  19565
> Traceback (most recent call last):
>   File "tests/stress/concurrent_select.py", line 2320, in <module>
>     main()
>   File "tests/stress/concurrent_select.py", line 2316, in main
>     select_probability=args.select_probability)
>   File "tests/stress/concurrent_select.py", line 529, in run_queries
>     self._wait_for_test_to_finish(impala, should_print_status)
>   File "tests/stress/concurrent_select.py", line 943, in 
> _wait_for_test_to_finish
>     self._print_status(print_header=(lines_printed == 0))
>   File "tests/stress/concurrent_select.py", line 837, in _print_status
>     self._num_queries_executing,
>   File "tests/stress/concurrent_select.py", line 690, in 
> _num_queries_executing
>     assert num_executing >= 0, "The number of executing queries is negative"
> AssertionError: The number of executing queries is negative
> {noformat}
> I've run many thousand-query stress test runs against 3.1 without hitting 
> this. Where this finally happened was against a 2.12-based build, if that 
> matters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to