[ https://issues.apache.org/jira/browse/IMPALA-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725417#comment-16725417 ]
bharath v edited comment on IMPALA-8007 at 12/20/18 12:46 AM: -------------------------------------------------------------- I think one of the problems here is that {{secs_since_heartbeat}} is computed from the time the request actually runs on the Impala webserver. {noformat} Value time_since_heartbeat( StringPrintf( "%.3f", ((MonotonicMicros() - subscriber.second->last_heartbeat_timestamp()) / 1000000.0)).c_str(), document->GetAllocator()); {noformat} If there is a delay (say context switching or CPU load etc), the value could be skewed and the assert doesn't hold. I could repro the first assert failure by adding a simple sleep like follows. {noformat} diff --git a/tests/statestore/test_statestore.py b/tests/statestore/test_statestore.py index f9e61cf..3493f0c 100644 --- a/tests/statestore/test_statestore.py +++ b/tests/statestore/test_statestore.py @@ -60,6 +60,7 @@ LOG = logging.getLogger('test_statestore') # Test that topic deletions take effect correctly. def get_statestore_subscribers(host='localhost', port=25010): + time.sleep(5) response = urllib2.urlopen("http://{0}:{1}/subscribers?json".format(host, port)) page = response.read() return json.loads(page) @@ -562,14 +563,17 @@ class TestStatestore(): acceptable value. Since the statestore heartbeats at 1 second intervals, an acceptable value would be between ((sleep_time-1.0), (sleep_time+1.0)).""" sub = StatestoreSubscriber() - sub.start().register().wait_for_heartbeat(1) - sub.kill() - sleep_time = randint(1, 9) + logging.info("Registering sub and waiting for 1 hb: " + sub.subscriber_id) + sub.start().register().wait_for_heartbeat(1).kill() + logging.info("Killing sub: " + sub.subscriber_id); + sleep_time = 6 time.sleep(sleep_time) + logging.info("slept for 6s") subscribers = get_statestore_subscribers()["subscribers"] for s in subscribers: if str(s["id"]) == sub.subscriber_id: secs_since_heartbeat = float(s["secs_since_heartbeat"]) + logging.info("secs_since_heartbeat: " + str(secs_since_heartbeat)) assert (secs_since_heartbeat > float(sleep_time - 1.0)) assert (secs_since_heartbeat < float(sleep_time + 1.0)) {noformat} was (Author: bharathv): I think one of the problems here is that {{secs_since_heartbeat}} is computed from the time the request is made to the webserver. {noformat} Value time_since_heartbeat( StringPrintf( "%.3f", ((MonotonicMicros() - subscriber.second->last_heartbeat_timestamp()) / 1000000.0)).c_str(), document->GetAllocator()); {noformat} If there is a delay (say context switching or CPU load etc), the value could be skewed and the assert doesn't hold. I could repro the first assert failure by adding a simple sleep like follows. {noformat} diff --git a/tests/statestore/test_statestore.py b/tests/statestore/test_statestore.py index f9e61cf..3493f0c 100644 --- a/tests/statestore/test_statestore.py +++ b/tests/statestore/test_statestore.py @@ -60,6 +60,7 @@ LOG = logging.getLogger('test_statestore') # Test that topic deletions take effect correctly. def get_statestore_subscribers(host='localhost', port=25010): + time.sleep(5) response = urllib2.urlopen("http://{0}:{1}/subscribers?json".format(host, port)) page = response.read() return json.loads(page) @@ -562,14 +563,17 @@ class TestStatestore(): acceptable value. Since the statestore heartbeats at 1 second intervals, an acceptable value would be between ((sleep_time-1.0), (sleep_time+1.0)).""" sub = StatestoreSubscriber() - sub.start().register().wait_for_heartbeat(1) - sub.kill() - sleep_time = randint(1, 9) + logging.info("Registering sub and waiting for 1 hb: " + sub.subscriber_id) + sub.start().register().wait_for_heartbeat(1).kill() + logging.info("Killing sub: " + sub.subscriber_id); + sleep_time = 6 time.sleep(sleep_time) + logging.info("slept for 6s") subscribers = get_statestore_subscribers()["subscribers"] for s in subscribers: if str(s["id"]) == sub.subscriber_id: secs_since_heartbeat = float(s["secs_since_heartbeat"]) + logging.info("secs_since_heartbeat: " + str(secs_since_heartbeat)) assert (secs_since_heartbeat > float(sleep_time - 1.0)) assert (secs_since_heartbeat < float(sleep_time + 1.0)) {noformat} > test_slow_subscriber is flaky > ----------------------------- > > Key: IMPALA-8007 > URL: https://issues.apache.org/jira/browse/IMPALA-8007 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 3.2.0 > Reporter: bharath v > Assignee: Pooja Nilangekar > Priority: Major > Labels: broken-build > Fix For: Impala 3.2.0 > > > We have hit both the asserts in the test. > *Exhaustive:* > {noformat} > statestore/test_statestore.py:574: in test_slow_subscriber assert > (secs_since_heartbeat < float(sleep_time + 1.0)) E assert > 8.8040000000000003 < 6.0 E + where 6.0 = float((5 + 1.0)) > Stacktrace > statestore/test_statestore.py:574: in test_slow_subscriber > assert (secs_since_heartbeat < float(sleep_time + 1.0)) > E assert 8.8040000000000003 < 6.0 > E + where 6.0 = float((5 + 1.0)) > {noformat} > *ASAN* > {noformat} > Error Message > statestore/test_statestore.py:573: in t assert (secs_since_heartbeat > > float(sleep_time - 1.0)) E assert 4.995 > 5.0 E + where 5.0 = float((6 > - 1.0)) > Stacktrace > statestore/test_statestore.py:573: in test_slow_subscriber > assert (secs_since_heartbeat > float(sleep_time - 1.0)) > E assert 4.995 > 5.0 > E + where 5.0 = float((6 - 1.0)) > {noformat} > I only noticed this happen twice (the above two instances) since the patch is > committed. So, looks like a racy bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org