aditya-r-m opened a new issue #9343: [Proposal] Pubsub Indexing Service URL: https://github.com/apache/druid/issues/9343 ### Motivation For streaming ingestion, Kafka/Kinesis queues are the two primary options for Druid as of now. Proposal is to write an extension that will allow Data ingestion from Google Cloud Pubsub. ### Proposed changes The proposed extension will work in a manner very similar to Seekable Stream Supervisors & Tasks that we currently have & will be a simplified version of the same in most cases. Key differences between pubsub & kafka queues in context of this implementation are as follows, 1. Unlike Kafka, PubSub does not have a concept of ordered logs. Packets are pulled in batches from the cloud & acknowledgements are sent after successful processing per packet. 2. While Kafka has a Topic/Partition hierarchy such that one packet for a topic goes only in one of it's partitions, PubSub has Topic/Subscription hierarchy where any packets that is sent to a topic is replicated in all of the subscriptions. In the ideal case, each packet should be pulled only once from a subscription. One key design decision that we are suggesting is to have a completely independent extension that does not share any logic with the Seekable Stream Ingestion extensions. 1. The ingestion specs will be very similar to Kafka Ingestion specs, with configuration options which mostly overlap but with some additions & removals. 2. The structure & patterns in the code will be a simplified merger of SeekableStreamIndexingService & KafkaIndexingService. One key challenge to note is that PubSub does not provide exactly-once semantics like Kafka. This means that in cases of high lag or failures, consumers may pull the same packet more than once from the subscription. One reasonable approach to tackle this is best effort deduplication. There are techniques to minimize duplication, a few of them explained as follows, - Ack deadline tweaks: A consumer can reset ack deadlines when some packets are taking more time to process than expected. Also, having a sensible ack deadline configuration will also have a massive impact to start with. - Configurable LRU Cache / Bloom filter based dedup: A local sketch of unique packets encountered so far in the previous 'x' minutes can prevent duplicate insertion on the basis of unique packet ids. It could be possible to provide perfect deduplication using a shared key-value store but that would be out of scope for the first version of the extension. ### Rationale A discussion of why this particular solution is the best one. One good way to approach this is to discuss other alternative solutions that you considered and decided against. This should also include a discussion of any specific benefits or drawbacks you are aware of. ### Operational impact Since the extension is a completely new feature with different pathways, it does not have operational impact on running druid clusters. The dependencies that are globally being updated are the Guava & Guice versions which are very outdated & do not work with latest pubsub libraries. Any possible regression from it should be caught by automated tests. ### Test plan TODO ### Future work TODO
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org