[
https://issues.apache.org/jira/browse/ATLAS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316226#comment-15316226
]
Shwetha G S commented on ATLAS-801:
-----------------------------------
Agree with the short term solution for now.
1. Ensure multiple replicas for Kafka - the number of consumer threads will
still be 1, right? Else, we might end up with some message handling failures
2. Add retries when sending messages to Kafka - Doesn't all hooks have
configurable retries already, with default 3
3. Build some way of alerting admins of errors in communicating to Kafka - we
probably need to hook into alerting mechanism of each component?
Additionally, we can also write failed messages to separate file using log4j
and use log4j to roll the file based on size
> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>
> Key: ATLAS-801
> URL: https://issues.apache.org/jira/browse/ATLAS-801
> Project: Atlas
> Issue Type: Improvement
> Reporter: Hemanth Yamijala
> Assignee: Hemanth Yamijala
>
> All integration hooks in Atlas write messages to Kafka which are picked up by
> the Atlas server. If communication to Kafka breaks, then this results in loss
> of metadata messages. This can be mitigated to some extent using multiple
> replicas for Kafka topics (see ATLAS-515). This JIRA is to see if we can make
> this even more robust and have some form of store and forward mechanism for
> increased fault tolerance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)