[jira] [Updated] (BEAM-5386) Flink Runner gets progressively stuck when Pubsub subscription is nearly empty

Kenneth Knowles (JIRA) Wed, 03 Apr 2019 09:52:48 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kenneth Knowles updated BEAM-5386:
----------------------------------
    Fix Version/s:     (was: 2.7.1)

> Flink Runner gets progressively stuck when Pubsub subscription is nearly empty
> ------------------------------------------------------------------------------
>
>                 Key: BEAM-5386
>                 URL: https://issues.apache.org/jira/browse/BEAM-5386
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp, runner-flink
>    Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0
>            Reporter: Encho Mishinev
>            Assignee: Maximilian Michels
>            Priority: Major
>             Fix For: 2.10.0
>
>          Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> I am running the Flink runner on Apache Beam 2.6.0.
> My pipeline involves reading from Google Cloud Pubsub. The problem is that 
> whenever there are few messages left in the subscription I'm reading from, 
> the whole job becomes progressively slower and slower, Flink's checkpoints 
> start taking much more time and messages seem to not get properly 
> acknowledged.
> This happens only whenever the subscription is nearly empty. For example when 
> running 13 taskmanagers with parallelism of 52 for the job and a subscription 
> that has 122 000 000 messages, you start feeling the slowing down after there 
> are only 1 000 000 - 2 000 000 messages left.
> In one of my tests the job processed nearly 122 000 000 messages in an hour 
> and then spent over 30 minutes attempting to do the few hundred thousand 
> left. In the end it was reading a few hundred messages a minute and not 
> reading at all for some periods. Upon stopping it the subscription still had 
> 235 unacknowledged messages, even though Flink's element count was higher 
> than the amount of messages I had loaded. The only explanation is that the 
> messages did not get properly acknowledged and were resent.
> I have set up the subscriptions to a large acknowledgment deadline, but that 
> does not help.
> I did smaller tests on subscriptions with 100 000 messages and a job that 
> simply reads and does nothing else. The problem is still evident. With 
> parallelism of 52 the job gets slow right away. Takes over 5min to read about 
> 100 000 messages and a few hundred seem to keep cycling through never being 
> acknowledged.
> On the other hand a parallelism of 1 works fine until there are about 5000 
> messages left, and then slows down similarly.
> Parallelism of 16 reads about 75 000 of the 100 000 immediately (a few 
> seconds) and then proceeds to slowly work on the other 25 000 for minutes.
> The PubsubIO connector is provided by Beam so I suspect the problem to be in 
> Beam's Flink runner rather than Flink itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-5386) Flink Runner gets progressively stuck when Pubsub subscription is nearly empty

Reply via email to