[GitHub] [beam] kennknowles opened a new issue, #19017: Beam spark runner not working properly with kafka

GitBox Fri, 03 Jun 2022 15:03:52 -0700


kennknowles opened a new issue, #19017:
URL: https://github.com/apache/beam/issues/19017


   We are running a beam stream processing job on a spark runner, which reads 
from a kafka topic using kerberos authentication. We are using java-io-kafka 
v2.4.0 to read from kafka topic in the pipeline. The issue is that the kafkaIO 
client is continuously creating a new kafka consumer with specified config, 
doing kerberos login every time. Also, there are spark streaming jobs which get 
spawned for the unbounded source, every second or so even when there is no data 
in the kafka topic. Log has these jobs-
   
   INFO SparkContext: Starting job: [email protected]:172
   
   We can see in the logs
   
   INFO MicrobatchSource: No cached reader found for split: 
[org.apache.beam.sdk.io.kafka.KafkaUnboundedSource@2919a728]. Creating new 
reader at checkpoint mark...
   
   And then it creates new consumer doing fresh kerberos login, which is 
creating issues.
   
   We are unsure of what should be correct behavior here and why so many spark 
streaming jobs are getting created. We tried the beam code with flink runner 
and did not find this issue there. Can someone point to the correct settings 
for using unbounded kafka source with spark runner using beam? 
   
   Imported from Jira 
[BEAM-4803](https://issues.apache.org/jira/browse/BEAM-4803). Original Jira may 
contain additional context.
   Reported by: vivek_17.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] kennknowles opened a new issue, #19017: Beam spark runner not working properly with kafka

Reply via email to