aditya-r-m opened a new issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343
 
 
   ### Motivation
   
   For streaming ingestion, Kafka/Kinesis queues are the two primary options 
for Druid as of now.
   Proposal is to write an extension that will allow Data ingestion from Google 
Cloud Pubsub.
   
   ### Proposed changes
   
   The proposed extension will work in a manner very similar to Seekable Stream 
Supervisors & Tasks that we currently have & will be a simplified version of 
the same in most cases.
   
   Key differences between pubsub & kafka queues in context of this 
implementation are as follows,
   1. Unlike Kafka, PubSub does not have a concept of ordered logs. Packets are 
pulled in batches from the cloud & acknowledgements are sent after successful 
processing per packet.
   2. While Kafka has a Topic/Partition hierarchy such that one packet for a 
topic goes only in one of it's partitions, PubSub has Topic/Subscription 
hierarchy where any packets that is sent to a topic is replicated in all of the 
subscriptions. In the ideal case, each packet should be pulled only once from a 
subscription.
   
   One key design decision that we are suggesting is to have a completely 
independent extension that does not share any logic with the Seekable Stream 
Ingestion extensions.
   
   1. The ingestion specs will be very similar to Kafka Ingestion specs, with 
configuration options which mostly overlap but with some additions & removals.
   2. The structure & patterns in the code will be a simplified merger of 
SeekableStreamIndexingService & KafkaIndexingService.
   
   One key challenge to note is that PubSub does not provide exactly-once 
semantics like Kafka.
   This means that in cases of high lag or failures, consumers may pull the 
same packet more than once from the subscription.
   One reasonable approach to tackle this is best effort deduplication.
   There are techniques to minimize duplication, a few of them explained as 
follows,
   
   - Ack deadline tweaks: A consumer can reset ack deadlines when some packets 
are taking more time to process than expected. Also, having a sensible ack 
deadline configuration will also have a massive impact to start with.
   - Configurable LRU Cache / Bloom filter based dedup: A local sketch of 
unique packets encountered so far in the previous 'x' minutes can prevent 
duplicate insertion on the basis of unique packet ids.
   
   It could be possible to provide perfect deduplication using a shared 
key-value store but that would be out of scope for the first version of the 
extension.
   
   ### Rationale
   
   A discussion of why this particular solution is the best one. One good way 
to approach this is to discuss other alternative solutions that you considered 
and decided against. This should also include a discussion of any specific 
benefits or drawbacks you are aware of.
   
   ### Operational impact
   
   Since the extension is a completely new feature with different pathways, it 
does not have operational impact on running druid clusters.
   The dependencies that are globally being updated are the Guava & Guice 
versions which are very outdated & do not work with latest pubsub libraries. 
Any possible regression from it should be caught by automated tests.
   
   
   ### Test plan
   TODO
   
   ### Future work
   TODO
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to