[
https://issues.apache.org/jira/browse/BEAM-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kenneth Knowles updated BEAM-5386:
----------------------------------
Fix Version/s: (was: 2.7.1)
> Flink Runner gets progressively stuck when Pubsub subscription is nearly empty
> ------------------------------------------------------------------------------
>
> Key: BEAM-5386
> URL: https://issues.apache.org/jira/browse/BEAM-5386
> Project: Beam
> Issue Type: Bug
> Components: io-java-gcp, runner-flink
> Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0
> Reporter: Encho Mishinev
> Assignee: Maximilian Michels
> Priority: Major
> Fix For: 2.10.0
>
> Time Spent: 6h 20m
> Remaining Estimate: 0h
>
> I am running the Flink runner on Apache Beam 2.6.0.
> My pipeline involves reading from Google Cloud Pubsub. The problem is that
> whenever there are few messages left in the subscription I'm reading from,
> the whole job becomes progressively slower and slower, Flink's checkpoints
> start taking much more time and messages seem to not get properly
> acknowledged.
> This happens only whenever the subscription is nearly empty. For example when
> running 13 taskmanagers with parallelism of 52 for the job and a subscription
> that has 122 000 000 messages, you start feeling the slowing down after there
> are only 1 000 000 - 2 000 000 messages left.
> In one of my tests the job processed nearly 122 000 000 messages in an hour
> and then spent over 30 minutes attempting to do the few hundred thousand
> left. In the end it was reading a few hundred messages a minute and not
> reading at all for some periods. Upon stopping it the subscription still had
> 235 unacknowledged messages, even though Flink's element count was higher
> than the amount of messages I had loaded. The only explanation is that the
> messages did not get properly acknowledged and were resent.
> I have set up the subscriptions to a large acknowledgment deadline, but that
> does not help.
> I did smaller tests on subscriptions with 100 000 messages and a job that
> simply reads and does nothing else. The problem is still evident. With
> parallelism of 52 the job gets slow right away. Takes over 5min to read about
> 100 000 messages and a few hundred seem to keep cycling through never being
> acknowledged.
> On the other hand a parallelism of 1 works fine until there are about 5000
> messages left, and then slows down similarly.
> Parallelism of 16 reads about 75 000 of the 100 000 immediately (a few
> seconds) and then proceeds to slowly work on the other 25 000 for minutes.
> The PubsubIO connector is provided by Beam so I suspect the problem to be in
> Beam's Flink runner rather than Flink itself.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)