[
https://issues.apache.org/jira/browse/ATLAS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313210#comment-15313210
]
Suma Shivaprasad commented on ATLAS-801:
----------------------------------------
Given that the current hooks themselves may not capture 100% of all events for
any component due to lack of integration points(eg pig jobs or any job that
creates tables directly to Hcatalog etc) and it is known that we will lose some
events due to the above or due to network partition, client side hooks etc , it
makes more sense to invest on reconciliation than on fault tolerance when Kafka
is down. Agree that reconciliation is a larger effort and there could be
complications while handling entity deletions ( which could be dropped in the
source but present in atlas repository) which needs to be handled along with
out of order events. The other aspects of reconciliation could be within Titan
itself where there could be inconsistencies between the storage and indexing
backend -
http://s3.thinkaurelius.com/docs/titan/0.5.0/failure-recovery.html#_transaction_failure
Hence a +1 for the short term solution but I feel we should invest some time
on some short term reconciliation effort as part of another jira(but not sure
if we have bandwidth). As part of work for this jira, we could also add event
time to each notification message to easily track event times across
hooks.(Dont think we capture that currently)
> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>
> Key: ATLAS-801
> URL: https://issues.apache.org/jira/browse/ATLAS-801
> Project: Atlas
> Issue Type: Improvement
> Reporter: Hemanth Yamijala
> Assignee: Hemanth Yamijala
>
> All integration hooks in Atlas write messages to Kafka which are picked up by
> the Atlas server. If communication to Kafka breaks, then this results in loss
> of metadata messages. This can be mitigated to some extent using multiple
> replicas for Kafka topics (see ATLAS-515). This JIRA is to see if we can make
> this even more robust and have some form of store and forward mechanism for
> increased fault tolerance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)