[
https://issues.apache.org/jira/browse/ATLAS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311825#comment-15311825
]
Hemanth Yamijala commented on ATLAS-801:
----------------------------------------
Starting some analysis notes.
Firstly, I will try to see what can be done to minimize the probability of this
happening first. This is low hanging fruit to improve the current situation.
* We need to ensure we configure multiple replicas for ATLAS_HOOK in Kafka.
This is already documented as an operational guidance
[here|http://atlas.incubator.apache.org/HighAvailability.html] under the
*Notification Server* section. We could potentially automate this as part of
server setup of Atlas. This was the topic of ATLAS-515.
* We could add some retries to the producer config of Kafka. Currently, we use
the default values which is no retries.
I explored other configuration in Kafka producers and feel we are OK there.
Specifically:
* *acks* - we use the default value of 1, which is acknowledgement from the
leader alone. This gives us a right balance between reliability and throughput.
* *batch.size* - we use the default value of 16KB. Empirically, our message
size seems to be about 8 KB. So maybe we send 2 messages per batch. Again, not
too much to gain by changing this here I guess.
> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>
> Key: ATLAS-801
> URL: https://issues.apache.org/jira/browse/ATLAS-801
> Project: Atlas
> Issue Type: Improvement
> Reporter: Hemanth Yamijala
> Assignee: Hemanth Yamijala
>
> All integration hooks in Atlas write messages to Kafka which are picked up by
> the Atlas server. If communication to Kafka breaks, then this results in loss
> of metadata messages. This can be mitigated to some extent using multiple
> replicas for Kafka topics (see ATLAS-515). This JIRA is to see if we can make
> this even more robust and have some form of store and forward mechanism for
> increased fault tolerance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)