[jira] [Work logged] (BEAM-5386) Flink Runner gets progressively stuck when Pubsub subscription is nearly empty

ASF GitHub Bot (JIRA) Sun, 30 Dec 2018 11:24:54 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-5386?focusedWorklogId=179825&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-179825
 ]


ASF GitHub Bot logged work on BEAM-5386:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 30/Dec/18 19:23
            Start Date: 30/Dec/18 19:23
    Worklog Time Spent: 10m 
      Work Description: mxm commented on pull request #7349: [BEAM-5386] 
Prevent CheckpointMarks from not getting acknowledged
URL: https://github.com/apache/beam/pull/7349#discussion_r244543221
 
 

 ##########
 File path: 
runners/flink/src/test/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapperTest.java
 ##########
 @@ -626,13 +650,29 @@ public void testSourceWithNoReaderDoesNotShutdown() 
throws Exception {
 
       try {
         thread.start();
-        List<UnboundedSource.UnboundedReader<KV<Integer, Integer>>> 
localReaders =
-            sourceWrapper.getLocalReaders();
-        while (localReaders != null && !localReaders.isEmpty()) {
-          Thread.sleep(200);
-          // should stay alive
-          assertThat(thread.isAlive(), is(true));
+        // Wait to see if the wrapper shuts down immediately in case it 
doesn't have readers
+        if (!shouldHaveReaders) {
+          // The expected state is for finalizeSource to sleep instead of 
exiting
+          while (true) {
+            StackTraceElement[] callStack = thread.getStackTrace();
+            if (callStack.length >= 2
+                && "sleep".equals(callStack[0].getMethodName())
+                && "finalizeSource".equals(callStack[1].getMethodName())) {
+              break;
+            }
+            Thread.sleep(10);
+          }
+        }
+        // Source should still be running even if there are no readers
+        assertThat(sourceWrapper.isRunning(), is(true));
+        synchronized (checkpointLock) {
+          // Trigger emission of the watermark by updating processing time.
+          // The actual processing time value does not matter.
+          sourceWrapper.onProcessingTime(42);
         }
+        // Source should still be running even when watermark is at max
+        assertThat(sourceWrapper.isRunning(), is(true));
+        assertThat(thread.isAlive(), is(true));
         sourceWrapper.cancel();
       } finally {
         thread.interrupt();
 
 Review comment:
   Yes, that's a good addition.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 179825)
    Time Spent: 5h 20m  (was: 5h 10m)

> Flink Runner gets progressively stuck when Pubsub subscription is nearly empty
> ------------------------------------------------------------------------------
>
>                 Key: BEAM-5386
>                 URL: https://issues.apache.org/jira/browse/BEAM-5386
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp, runner-flink
>    Affects Versions: 2.6.0
>            Reporter: Encho Mishinev
>            Assignee: Chamikara Jayalath
>            Priority: Major
>          Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> I am running the Flink runner on Apache Beam 2.6.0.
> My pipeline involves reading from Google Cloud Pubsub. The problem is that 
> whenever there are few messages left in the subscription I'm reading from, 
> the whole job becomes progressively slower and slower, Flink's checkpoints 
> start taking much more time and messages seem to not get properly 
> acknowledged.
> This happens only whenever the subscription is nearly empty. For example when 
> running 13 taskmanagers with parallelism of 52 for the job and a subscription 
> that has 122 000 000 messages, you start feeling the slowing down after there 
> are only 1 000 000 - 2 000 000 messages left.
> In one of my tests the job processed nearly 122 000 000 messages in an hour 
> and then spent over 30 minutes attempting to do the few hundred thousand 
> left. In the end it was reading a few hundred messages a minute and not 
> reading at all for some periods. Upon stopping it the subscription still had 
> 235 unacknowledged messages, even though Flink's element count was higher 
> than the amount of messages I had loaded. The only explanation is that the 
> messages did not get properly acknowledged and were resent.
> I have set up the subscriptions to a large acknowledgment deadline, but that 
> does not help.
> I did smaller tests on subscriptions with 100 000 messages and a job that 
> simply reads and does nothing else. The problem is still evident. With 
> parallelism of 52 the job gets slow right away. Takes over 5min to read about 
> 100 000 messages and a few hundred seem to keep cycling through never being 
> acknowledged.
> On the other hand a parallelism of 1 works fine until there are about 5000 
> messages left, and then slows down similarly.
> Parallelism of 16 reads about 75 000 of the 100 000 immediately (a few 
> seconds) and then proceeds to slowly work on the other 25 000 for minutes.
> The PubsubIO connector is provided by Beam so I suspect the problem to be in 
> Beam's Flink runner rather than Flink itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-5386) Flink Runner gets progressively stuck when Pubsub subscription is nearly empty

Reply via email to