[GitHub] [druid] mgill25 commented on issue #9343: [Proposal] Pubsub Indexing Service

GitBox Fri, 14 Feb 2020 03:55:27 -0800

mgill25 commented on issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343#issuecomment-586257689
 
 
   Hi @jihoonson 
   
   * What semantics is guaranteed by the proposed indexing service? I don't 
think the exactly-once ingestion is possible. And how does the proposed 
indexing service guarantee it?
   
   We are proposing a 2 step approach:
   
        1. Make a naive pubsub indexing service which provides all the 
guarantees that a regular pubsub consumer would do - that is, at-least once 
message semantics. This would in in-line with any normal pubsub consumer would 
work.
   
        2. Do some basic research into how systems such as dataflow achieve 
exactly once processing with pubsub. It is clearly possible to achieve this, 
since dataflow does it with pubsub (although the details of precisely how are 
not yet clear to us). This will be more of an exploratory work.
   
   * Description on the overall algorithm including what the supervisor and its 
tasks do, respectively.
        - The Supervisor looks pretty similar to the KafkaStreamSupervisor's 
basic functions - creation and management of tasks
   
        - If more tasks are required to maintain active task count, it submits 
new tasks.
   
        - A single task would be doing the following basic things:
   
                - Connects to a pubsub subscription
                - Pull in a batch from pubsub (relevant tuning parameters 
should be available in config)
                - Packets are handed off for persistence.
                - On successfully persisting, send back an ACK message to 
pubsub for the batch.
   
   * Does the proposed indexing service provide linear scalability? If so, how 
does it provide?
   
        The service can keep launching new tasks to process data from 
subscriptions, as needed. The supervisor can do periodic checks for pubsub 
metrics, and if the rate of message consumption is falling behind compared to 
the production rate, it can launch new tasks across the cluster.
   
   * How does it handle transient failures such as task failures?
        - If a task fails before a successful ACK has been sent out, it should 
be reprocessed.
        - Data successfully persisted, but ACK delivery fails. In this case, we 
would want to introduce a retry policy.
        - In case of permanent failure, pubsub would redeliver the message, 
which is in line with the at-least once guarantee of the indexing service.
   
   * Exactly Once case: I think it's fair to say we currently don't have an 
extremely clear understanding of making exactly once work, but we know other 
systems do claim to provide those guarantees. I'm interested in trying to see 
if we can achieve the same with Druid, but for that to happen, the foundation 
as described above needs to be built first, IMHO.
   
   There are unanswered questions here that we haven't fleshed out yet. Would 
be happy to brainstorm. :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] mgill25 commented on issue #9343: [Proposal] Pubsub Indexing Service

Reply via email to