[ 
https://issues.apache.org/jira/browse/ATLAS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313763#comment-15313763
 ] 

Hemanth Yamijala commented on ATLAS-801:
----------------------------------------

Thanks for sharing your thoughts, Vimal. There are different sorts of 
distributed failure scenarios. If Kafka is indeed down, and no application 
could reach it - then everyone would potentially end up recording messages to a 
distributed store. A linear ordering there with timestamps will help. However, 
there could also be network partitions such that only one of the hook instances 
is unable to reach Kafka. In that case, other instances which could be 
activated later will not know about the one failed hook instance. This could 
result in out of order messages. 

Given the complexity of distributed coordination, I don't think we should 
invest in solving the problem of enforcing an order on the messages. Instead, I 
think we should invest in solving the problem of dealing with out of order 
messages.

> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>
>                 Key: ATLAS-801
>                 URL: https://issues.apache.org/jira/browse/ATLAS-801
>             Project: Atlas
>          Issue Type: Improvement
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>
> All integration hooks in Atlas write messages to Kafka which are picked up by 
> the Atlas server. If communication to Kafka breaks, then this results in loss 
> of metadata messages. This can be mitigated to some extent using multiple 
> replicas for Kafka topics (see ATLAS-515). This JIRA is to see if we can make 
> this even more robust and have some form of store and forward mechanism for 
> increased fault tolerance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to