yanspirit opened a new pull request, #91:
URL: https://github.com/apache/flink-connector-kafka/pull/91

   when partition leader invalid(leader=-1),  the flink streaming job using 
KafkaSource can't restart or start a new instance with a new groupid,  it will 
stuck and got following exception:
   
   "org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired 
before the position for partition aaa-1 could be determined"
   
   when leader=-1,  kafka api like KafkaConsumer.position() will block until 
either the position could be determined or an unrecoverable error is 
encountered 
   
   infact,  leader=-1 not easy to avoid,  even replica=3, three disk offline 
together will trigger the problem, especially when the cluster size is 
relatively large.    it rely on kafka administrator to fix in time,  but it 
take risk when in kafka cluster peak period.
   
   This can be addressed by using the invalid leader filter and discovery 
partition interval.
    
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to