Kenneth Knowles created BEAM-6998:
-------------------------------------

             Summary: LTS backport: Flink Runner gets progressively stuck when 
Pubsub subscription is nearly empty
                 Key: BEAM-6998
                 URL: https://issues.apache.org/jira/browse/BEAM-6998
             Project: Beam
          Issue Type: Bug
          Components: io-java-gcp, runner-flink
    Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0
            Reporter: Encho Mishinev
            Assignee: Maximilian Michels
             Fix For: 2.10.0


I am running the Flink runner on Apache Beam 2.6.0.

My pipeline involves reading from Google Cloud Pubsub. The problem is that 
whenever there are few messages left in the subscription I'm reading from, the 
whole job becomes progressively slower and slower, Flink's checkpoints start 
taking much more time and messages seem to not get properly acknowledged.

This happens only whenever the subscription is nearly empty. For example when 
running 13 taskmanagers with parallelism of 52 for the job and a subscription 
that has 122 000 000 messages, you start feeling the slowing down after there 
are only 1 000 000 - 2 000 000 messages left.

In one of my tests the job processed nearly 122 000 000 messages in an hour and 
then spent over 30 minutes attempting to do the few hundred thousand left. In 
the end it was reading a few hundred messages a minute and not reading at all 
for some periods. Upon stopping it the subscription still had 235 
unacknowledged messages, even though Flink's element count was higher than the 
amount of messages I had loaded. The only explanation is that the messages did 
not get properly acknowledged and were resent.

I have set up the subscriptions to a large acknowledgment deadline, but that 
does not help.

I did smaller tests on subscriptions with 100 000 messages and a job that 
simply reads and does nothing else. The problem is still evident. With 
parallelism of 52 the job gets slow right away. Takes over 5min to read about 
100 000 messages and a few hundred seem to keep cycling through never being 
acknowledged.

On the other hand a parallelism of 1 works fine until there are about 5000 
messages left, and then slows down similarly.

Parallelism of 16 reads about 75 000 of the 100 000 immediately (a few seconds) 
and then proceeds to slowly work on the other 25 000 for minutes.

The PubsubIO connector is provided by Beam so I suspect the problem to be in 
Beam's Flink runner rather than Flink itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to