[jira] [Comment Edited] (IMPALA-8007) test_slow_subscriber is flaky

bharath v (JIRA) Wed, 19 Dec 2018 16:47:17 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725417#comment-16725417
 ]


bharath v edited comment on IMPALA-8007 at 12/20/18 12:46 AM:
--------------------------------------------------------------

I think one of the problems here is that {{secs_since_heartbeat}} is computed 
from the time the request actually runs on the Impala webserver.

{noformat}
  Value time_since_heartbeat(
        StringPrintf(
            "%.3f", ((MonotonicMicros() - 
subscriber.second->last_heartbeat_timestamp())
                        / 1000000.0)).c_str(), document->GetAllocator());
{noformat}

If there is a delay (say context switching or CPU load etc), the value could be 
skewed and the assert doesn't hold. I could repro the first assert failure by 
adding a simple sleep like follows.

{noformat}
diff --git a/tests/statestore/test_statestore.py 
b/tests/statestore/test_statestore.py
index f9e61cf..3493f0c 100644
--- a/tests/statestore/test_statestore.py
+++ b/tests/statestore/test_statestore.py
@@ -60,6 +60,7 @@ LOG = logging.getLogger('test_statestore')
 #    Test that topic deletions take effect correctly.
 
 def get_statestore_subscribers(host='localhost', port=25010):
+  time.sleep(5)
   response = urllib2.urlopen("http://{0}:{1}/subscribers?json".format(host, 
port))
   page = response.read()
   return json.loads(page)
@@ -562,14 +563,17 @@ class TestStatestore():
     acceptable value. Since the statestore heartbeats at 1 second intervals, 
an acceptable
     value would be between ((sleep_time-1.0), (sleep_time+1.0))."""
     sub = StatestoreSubscriber()
-    sub.start().register().wait_for_heartbeat(1)
-    sub.kill()
-    sleep_time = randint(1, 9)
+    logging.info("Registering sub and waiting for 1 hb: " + sub.subscriber_id)
+    sub.start().register().wait_for_heartbeat(1).kill()
+    logging.info("Killing sub: " + sub.subscriber_id);
+    sleep_time = 6
     time.sleep(sleep_time)
+    logging.info("slept for 6s")
     subscribers = get_statestore_subscribers()["subscribers"]
     for s in subscribers:
       if str(s["id"]) == sub.subscriber_id:
         secs_since_heartbeat = float(s["secs_since_heartbeat"])
+        logging.info("secs_since_heartbeat: " + str(secs_since_heartbeat))
         assert (secs_since_heartbeat > float(sleep_time - 1.0))
         assert (secs_since_heartbeat < float(sleep_time + 1.0))
{noformat}



was (Author: bharathv):
I think one of the problems here is that {{secs_since_heartbeat}} is computed 
from the time the request is made to the webserver.

{noformat}
  Value time_since_heartbeat(
        StringPrintf(
            "%.3f", ((MonotonicMicros() - 
subscriber.second->last_heartbeat_timestamp())
                        / 1000000.0)).c_str(), document->GetAllocator());
{noformat}

If there is a delay (say context switching or CPU load etc), the value could be 
skewed and the assert doesn't hold. I could repro the first assert failure by 
adding a simple sleep like follows.

{noformat}
diff --git a/tests/statestore/test_statestore.py 
b/tests/statestore/test_statestore.py
index f9e61cf..3493f0c 100644
--- a/tests/statestore/test_statestore.py
+++ b/tests/statestore/test_statestore.py
@@ -60,6 +60,7 @@ LOG = logging.getLogger('test_statestore')
 #    Test that topic deletions take effect correctly.
 
 def get_statestore_subscribers(host='localhost', port=25010):
+  time.sleep(5)
   response = urllib2.urlopen("http://{0}:{1}/subscribers?json".format(host, 
port))
   page = response.read()
   return json.loads(page)
@@ -562,14 +563,17 @@ class TestStatestore():
     acceptable value. Since the statestore heartbeats at 1 second intervals, 
an acceptable
     value would be between ((sleep_time-1.0), (sleep_time+1.0))."""
     sub = StatestoreSubscriber()
-    sub.start().register().wait_for_heartbeat(1)
-    sub.kill()
-    sleep_time = randint(1, 9)
+    logging.info("Registering sub and waiting for 1 hb: " + sub.subscriber_id)
+    sub.start().register().wait_for_heartbeat(1).kill()
+    logging.info("Killing sub: " + sub.subscriber_id);
+    sleep_time = 6
     time.sleep(sleep_time)
+    logging.info("slept for 6s")
     subscribers = get_statestore_subscribers()["subscribers"]
     for s in subscribers:
       if str(s["id"]) == sub.subscriber_id:
         secs_since_heartbeat = float(s["secs_since_heartbeat"])
+        logging.info("secs_since_heartbeat: " + str(secs_since_heartbeat))
         assert (secs_since_heartbeat > float(sleep_time - 1.0))
         assert (secs_since_heartbeat < float(sleep_time + 1.0))
{noformat}


> test_slow_subscriber is flaky
> -----------------------------
>
>                 Key: IMPALA-8007
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8007
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.2.0
>            Reporter: bharath v
>            Assignee: Pooja Nilangekar
>            Priority: Major
>              Labels: broken-build
>             Fix For: Impala 3.2.0
>
>
> We have hit both the asserts in the test.
> *Exhaustive:*
> {noformat}
> statestore/test_statestore.py:574: in test_slow_subscriber     assert 
> (secs_since_heartbeat < float(sleep_time + 1.0)) E   assert 
> 8.8040000000000003 < 6.0 E    +  where 6.0 = float((5 + 1.0))
> Stacktrace
> statestore/test_statestore.py:574: in test_slow_subscriber
>     assert (secs_since_heartbeat < float(sleep_time + 1.0))
> E   assert 8.8040000000000003 < 6.0
> E    +  where 6.0 = float((5 + 1.0))
> {noformat}
> *ASAN*
> {noformat}
> Error Message
> statestore/test_statestore.py:573: in t     assert (secs_since_heartbeat > 
> float(sleep_time - 1.0)) E   assert 4.995 > 5.0 E    +  where 5.0 = float((6 
> - 1.0))
> Stacktrace
> statestore/test_statestore.py:573: in test_slow_subscriber
>     assert (secs_since_heartbeat > float(sleep_time - 1.0))
> E   assert 4.995 > 5.0
> E    +  where 5.0 = float((6 - 1.0))
> {noformat}
> I only noticed this happen twice (the above two instances) since the patch is 
> committed. So, looks like a racy bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Comment Edited] (IMPALA-8007) test_slow_subscriber is flaky

Reply via email to