skonto edited a comment on issue #24613: [SPARK-27549][SS] Add support for 
committing kafka offsets per batch for supporting external tooling
URL: https://github.com/apache/spark/pull/24613#issuecomment-493974211
 
 
   > Why not use one group and listConsumerGroupOffsets?
   
   @gaborgsomogyi yes `listConsumerGroupOffsets` could be used but is it 
implemented for most clients? Check 
[here](https://github.com/edenhill/librdkafka/issues/2173). If people will use 
the admin-client it makes sense btw. It seems it does two 
[calls](https://github.com/apache/kafka/blob/3b1524c5dfd2a94f3fb919dad0de70984963772b/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java#L2776),
 first find the coordinator and then list the offsets.
   
   On the other hand when I say filtering, I dont mean filter the whole topic, 
it would mean pick up from the latest and as you see new records coming in that 
topic, process them or not based on the filter. Of course that could be also 
slow. Anyway, I dont have a clear view of the performance at the moment but I 
dont mind switching to supporting that special call if possible. In our open 
sourced tool [here](https://github.com/lightbend/kafka-lag-exporter/) we use 
`listConsumerGroupOffsets` anyway.
   
   When you say one group what do you mean? If I create the same groupId per 
source per query then partial data may be assigned as described 
[here](https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81).
   So it does not make sense for multiple queries running in parallel.
   If the same query is restarted then if using the same gId does not create 
issues then I could do it if I checkpoint that info eg. enforce a unique gID 
being equal to the guery Id that is persisted across restarts in the metadata 
dir. @gaborgsomogyi do you mean that?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to