[
https://issues.apache.org/jira/browse/IMPALA-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669410#comment-16669410
]
Michael Brown commented on IMPALA-7736:
---------------------------------------
I finally got time to come back to this, and I have a way to reproduce this
with some instrumented logging to see what's going on. Kind of a pain since I
have to wait for some time, but anyway....
The bug is under counting of {{self._num_queries_started_running_or_cancelled}}.
{noformat}
(Pdb) p self._num_queries_started_running_or_cancelled.value
213
(Pdb) p self._num_queries_finished.value
218
(Pdb)
{noformat}
In my logfile, I found 5 queries that failed to increment
{{self._num_queries_started_running_or_cancelled}} and they all failed like
this:
{noformat}
ExecQueryFInstances rpc query_id=f4c264ac41953b9:f289dfd400000000 failed:
Failed to get minimum memory reservation of 245.25 MB on daemon redacted:22000
for query f4c264ac41953b9:f289dfd400000000 due to following error: Failed to
increase reservation by 245.25 MB because it would exceed the applicable
reservation limit for the "Process" ReservationTracker: reservation_limit=42.67
GB reservation=42.64 GB used_reservation=0 child_reservations=42.64 GB
{noformat}
I instrumented concurrent_select.py to log when it's incrementing counters and
did some post-processing on the logs to make it readable. A full query where
this is going on:
{noformat}
2018-10-30 10:46:14,269 1427 Query Consumer
DEBUG:concurrent_select[725]:Getting query_idx
2018-10-30 10:46:14,269 1427 Query Consumer
DEBUG:concurrent_select[742]:Waiting for other query runners to start their
queries
2018-10-30 11:00:54,832 1427 Query Consumer
DEBUG:concurrent_select[748]:Requesting memory reservation
2018-10-30 11:01:04,350 1427 Query Consumer
DEBUG:concurrent_select[382]:Reserved 636 MB; 53 MB available; 64198 MB
overcommitted
2018-10-30 11:01:04,350 1427 Query Consumer
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for
c_ulong(186L)> with id 140453288913680
2018-10-30 11:01:04,351 1427 Query Consumer
DEBUG:concurrent_select[750]:Received memory reservation
2018-10-30 11:01:04,351 1427 Query Consumer
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(186)>
with id self._num_queries_submitted <--------------------------------- new
DEBUG statement with post-processed logging for counters I care about
2018-10-30 11:01:04,363 1427 Query Consumer DEBUG:concurrent_select[1126]:Using
tpcds_300_decimal_parquet database
2018-10-30 11:01:04,363 1427 Query Consumer DEBUG:db_connection[203]:IMPALA:
USE tpcds_300_decimal_parquet
2018-10-30 11:01:04,394 1427 Query Consumer DEBUG:db_connection[203]:IMPALA:
SET ABORT_ON_ERROR=1
2018-10-30 11:01:04,440 1427 Query Consumer
DEBUG:concurrent_select[1143]:Setting mem limit to 636 MB
2018-10-30 11:01:04,440 1427 Query Consumer DEBUG:db_connection[203]:IMPALA:
SET MEM_LIMIT=636M
2018-10-30 11:01:04,477 1427 Query Consumer
DEBUG:concurrent_select[1147]:Running query with 636 MB mem limit at redacted5
with timeout secs 45943.8717842:
-- RESULT MISMATCH FROM ORIGINAL
-- ADD ROUND()s TO 4th, 5th, 9th, 10th COLUMNS, TAKE ACTUAL RESULTS AS EXPECTED.
with inv as
(select w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy
,stdev,mean, case mean when 0 then null else stdev/mean end cov
from(select w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy
,stddev_samp(inv_quantity_on_hand) stdev,avg(inv_quantity_on_hand)
mean
from inventory
,item
,warehouse
,date_dim
where inv_item_sk = i_item_sk
and inv_warehouse_sk = w_warehouse_sk
and inv_date_sk = d_date_sk
and d_year =2001
group by w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy) foo
where case mean when 0 then 0 else stdev/mean end > 1)
select inv1.w_warehouse_sk,inv1.i_item_sk,inv1.d_moy
,round(inv1.mean, 2), round(inv1.cov, 8)
,inv2.w_warehouse_sk,inv2.i_item_sk,inv2.d_moy
,round(inv2.mean, 2), round(inv2.cov, 8)
from inv inv1,inv inv2
where inv1.i_item_sk = inv2.i_item_sk
and inv1.w_warehouse_sk = inv2.w_warehouse_sk
and inv1.d_moy=1
and inv2.d_moy=1+1
order by inv1.w_warehouse_sk,inv1.i_item_sk,inv1.d_moy,inv1.mean,inv1.cov
,inv2.d_moy,inv2.mean, inv2.cov
;
2018-10-30 11:01:04,803 1427 Query Consumer DEBUG:concurrent_select[1106]:Error
running query with id None: ExecQueryFInstances rpc
query_id=f4c264ac41953b9:f289dfd400000000 failed: Failed to get minimum memory
reservation of 245.25 MB on daemon redacted2:22000 for query
f4c264ac41953b9:f289dfd400000000 due to following error: Failed to increase
reservation by 245.25 MB because it would exceed the applicable reservation
limit for the "Process" ReservationTracker: reservation_limit=42.67 GB
reservation=42.64 GB used_reservation=0 child_reservations=42.64 GB
2018-10-30 11:01:04,805 1427 Query Consumer DEBUG:concurrent_select[760]:Got
execution report for query
2018-10-30 11:01:04,805 1427 Query Consumer
DEBUG:concurrent_select[868]:Updating runtime stats
2018-10-30 11:01:04,805 1427 Query Consumer
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(142)>
with id self._num_queries_finished
2018-10-30 11:01:04,805 1427 Query Consumer
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(11)>
with id 140453288911056
2018-10-30 11:01:04,806 1427 Query Consumer
DEBUG:concurrent_select[395]:Released 636 MB; 678 MB available; 63573 MB
overcommitted
{noformat}
We don't see an increment for
{{self._num_queries_started_running_or_cancelled}}.
Take instead this other query, same thread even, that shows the increment:
{noformat}
2018-10-30 10:38:40,989 1427 Query Consumer DEBUG:concurrent_select[708]:New
query runner started
2018-10-30 10:38:41,268 1427 Query Consumer
DEBUG:concurrent_select[725]:Getting query_idx
2018-10-30 10:38:41,268 1427 Query Consumer
DEBUG:concurrent_select[742]:Waiting for other query runners to start their
queries
2018-10-30 10:38:41,268 1427 Query Consumer
DEBUG:concurrent_select[748]:Requesting memory reservation
2018-10-30 10:38:41,268 1427 Query Consumer
DEBUG:concurrent_select[382]:Reserved 279 MB; 51160 MB available; 13091 MB
overcommitted
2018-10-30 10:38:41,268 1427 Query Consumer
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for
c_ulong(77L)> with id 140453288913680
2018-10-30 10:38:41,268 1427 Query Consumer
DEBUG:concurrent_select[750]:Received memory reservation
2018-10-30 10:38:41,268 1427 Query Consumer
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(77)>
with id self._num_queries_submitted
2018-10-30 10:38:41,377 1427 Query Consumer DEBUG:concurrent_select[1126]:Using
tpcds_300_decimal_parquet database
2018-10-30 10:38:41,377 1427 Query Consumer DEBUG:db_connection[203]:IMPALA:
USE tpcds_300_decimal_parquet
2018-10-30 10:38:42,718 1427 Query Consumer DEBUG:db_connection[203]:IMPALA:
SET ABORT_ON_ERROR=1
2018-10-30 10:38:43,468 1427 Query Consumer
DEBUG:concurrent_select[1143]:Setting mem limit to 279 MB
2018-10-30 10:38:43,468 1427 Query Consumer DEBUG:db_connection[203]:IMPALA:
SET MEM_LIMIT=279M
2018-10-30 10:38:44,324 1427 Query Consumer
DEBUG:concurrent_select[1147]:Running query with 279 MB mem limit at redacted5
with timeout secs 8091.05727196:
2018-10-30 10:38:51,970 1427 Query Consumer DEBUG:concurrent_select[1086]:Query
id is 514ecaffd5cce5e1:e693d96200000000
2018-10-30 10:38:51,971 1427 Query Consumer
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(86)>
with id self._num_queries_started_running_or_cancelled
<------------------------------------------------ this is correctly implemented
2018-10-30 10:38:57,259 1427 Query Consumer
DEBUG:concurrent_select[1182]:Waiting for query to execute
2018-10-30 10:46:12,321 1427 Fetch Results 514ecaffd5cce5e1:e693d96200000000
DEBUG:concurrent_select[1260]:Fetching result for query with id
514ecaffd5cce5e1:e693d96200000000
2018-10-30 10:46:12,620 1427 Fetch Results 514ecaffd5cce5e1:e693d96200000000
DEBUG:concurrent_select[1260]:Fetching result for query with id
514ecaffd5cce5e1:e693d96200000000
2018-10-30 10:46:12,622 1427 Fetch Results 514ecaffd5cce5e1:e693d96200000000
DEBUG:concurrent_select[1266]:No more results for query with id
514ecaffd5cce5e1:e693d96200000000
2018-10-30 10:46:12,641 1427 Query Consumer INFO:hiveserver2[265]:Closing
active operation
2018-10-30 10:46:14,268 1427 Query Consumer DEBUG:concurrent_select[760]:Got
execution report for query
2018-10-30 10:46:14,268 1427 Query Consumer
DEBUG:concurrent_select[868]:Updating runtime stats
2018-10-30 10:46:14,269 1427 Query Consumer
DEBUG:concurrent_select[188]:Incrementing <Synchronized wrapper for c_int(63)>
with id self._num_queries_finished
2018-10-30 10:46:14,269 1427 Query Consumer
DEBUG:concurrent_select[395]:Released 279 MB; 1431 MB available; 62820 MB
overcommitted
{noformat}
> stress: AssertionError: The number of executing queries is negative
> -------------------------------------------------------------------
>
> Key: IMPALA-7736
> URL: https://issues.apache.org/jira/browse/IMPALA-7736
> Project: IMPALA
> Issue Type: Bug
> Components: Infrastructure
> Affects Versions: Impala 3.1.0
> Reporter: Michael Brown
> Priority: Critical
> Labels: stress
>
> I'd heard about this anecdotally but haven't seen it until now:
> {noformat}
> Done | Active | Executing | Mem Lmt Ex | AC Reject | AC Timeout | Time Out |
> Cancel | Err | Incorrect | Next Qry Mem Lmt | Tot Qry Mem Lmt | Tracked Mem |
> RSS Mem
> 263 | 6 | 3 | 20 | 0 | 0 | 0 |
> 17 | 0 | 0 | 9818 | 26802 | 15656 |
> 19329
> 263 | 6 | 3 | 20 | 0 | 0 | 0 |
> 17 | 0 | 0 | 9818 | 26802 | 15534 |
> 19341
> 263 | 6 | 3 | 20 | 0 | 0 | 0 |
> 17 | 0 | 0 | 9818 | 26802 | 15462 |
> 19190
> 265 | 4 | 1 | 20 | 0 | 0 | 0 |
> 17 | 0 | 0 | 9818 | 22938 | 15482 |
> 19565
> Traceback (most recent call last):
> File "tests/stress/concurrent_select.py", line 2320, in <module>
> main()
> File "tests/stress/concurrent_select.py", line 2316, in main
> select_probability=args.select_probability)
> File "tests/stress/concurrent_select.py", line 529, in run_queries
> self._wait_for_test_to_finish(impala, should_print_status)
> File "tests/stress/concurrent_select.py", line 943, in
> _wait_for_test_to_finish
> self._print_status(print_header=(lines_printed == 0))
> File "tests/stress/concurrent_select.py", line 837, in _print_status
> self._num_queries_executing,
> File "tests/stress/concurrent_select.py", line 690, in
> _num_queries_executing
> assert num_executing >= 0, "The number of executing queries is negative"
> AssertionError: The number of executing queries is negative
> {noformat}
> I've run many thousand-query stress test runs against 3.1 without hitting
> this. Where this finally happened was against a 2.12-based build, if that
> matters.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]