[ 
https://issues.apache.org/jira/browse/ATLAS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313210#comment-15313210
 ] 

Suma Shivaprasad commented on ATLAS-801:
----------------------------------------

Given that the current hooks themselves may not capture 100% of all events for 
any component due to lack of integration points(eg pig jobs or any job that 
creates tables directly to Hcatalog etc) and it is known that we will lose some 
events due to the above or due to network partition, client side hooks etc , it 
makes more sense to invest on reconciliation than on fault tolerance when Kafka 
is down. Agree that  reconciliation is a larger effort and  there could be 
complications while handling entity deletions ( which could be dropped in the 
source but present in atlas repository) which needs to be handled along with 
out of order events. The other aspects of reconciliation could be within Titan 
itself where there could be inconsistencies between the storage and indexing 
backend - 
http://s3.thinkaurelius.com/docs/titan/0.5.0/failure-recovery.html#_transaction_failure

 Hence a +1 for the short term solution but I feel we should invest some time 
on some short term reconciliation effort as part of another jira(but not sure 
if we have bandwidth). As part of work for this jira, we could also add event 
time to each notification message to easily track event times across 
hooks.(Dont think we capture that currently)

> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>
>                 Key: ATLAS-801
>                 URL: https://issues.apache.org/jira/browse/ATLAS-801
>             Project: Atlas
>          Issue Type: Improvement
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>
> All integration hooks in Atlas write messages to Kafka which are picked up by 
> the Atlas server. If communication to Kafka breaks, then this results in loss 
> of metadata messages. This can be mitigated to some extent using multiple 
> replicas for Kafka topics (see ATLAS-515). This JIRA is to see if we can make 
> this even more robust and have some form of store and forward mechanism for 
> increased fault tolerance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to