c0de0ff opened a new issue #7776: kafka indexing service should always use 
earliest offset for newly discovered topic-partitions instead of 
useEarliestOffset config
URL: https://github.com/apache/incubator-druid/issues/7776
 
 
   ### Description
   
   Kafka-indexing-service currently uses `useEarliestOffset` config whenever it 
can't find any data for a topic-partition. This happens when the supervisor is 
running for the first time or when there is a new partition for this topic in 
kafka. The config is also used to reset the offsets ( if 
`resetOffsetAutomatically` is set to true ).
   
   In 2 of the above 3 scenarios, it makes sense to use `useEarliestOffset` 
config. However, it doesn't seem like indexing service should use this config 
on newly discovered partitions. If `useEarliestOffset` is set to false then 
this might result in data loss. In production environment, with large kafka 
clusters and many long running supervisors, adding new partitions to kafka 
topics would be a common occurrence and therefore this config must always 
remain true to avoid any data loss.
   
   A typical use case in large kafka clusters is to start the new supervisor 
from latest offset and keep consuming without any data loss ( exactly once ). 
In order to achieve this currently, we have to start supervisor with 
`useEarliestOffset` set to false and then wait for it to start running and then 
set the config back to true to avoid data loss in new partitions. User may also 
want to reset to latest offsets manually using the reset api, in this case 
also, he need to remember setting the config back to true which can be error 
prone.
   
   The solution to this might be to not use the config while getting offsets 
for new partitions ( always use earliest ), however, i am not sure how we can 
differentiate the 2 events "new partitions added" vs "supervisor first run".
   
   ### Motivation
   
   - Currently in order to avoid data loss from new partitions, we must always 
keep `useEarliestOffset` set to true, which creates the need to manually change 
the config back and forth in case we want to use the diff option for 
first-start/reset.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to