[ 
https://issues.apache.org/jira/browse/ATLAS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311825#comment-15311825
 ] 

Hemanth Yamijala commented on ATLAS-801:
----------------------------------------

Starting some analysis notes.

Firstly, I will try to see what can be done to minimize the probability of this 
happening first. This is low hanging fruit to improve the current situation.

* We need to ensure we configure multiple replicas for ATLAS_HOOK in Kafka. 
This is already documented as an operational guidance 
[here|http://atlas.incubator.apache.org/HighAvailability.html] under the 
*Notification Server* section. We could potentially automate this as part of 
server setup of Atlas. This was the topic of ATLAS-515.
* We could add some retries to the producer config of Kafka. Currently, we use 
the default values which is no retries.

I explored other configuration in Kafka producers and feel we are OK there. 
Specifically:

* *acks* - we use the default value of 1, which is acknowledgement from the 
leader alone. This gives us a right balance between reliability and throughput.
* *batch.size* - we use the default value of 16KB. Empirically, our message 
size seems to be about 8 KB. So maybe we send 2 messages per batch. Again, not 
too much to gain by changing this here I guess.

> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>
>                 Key: ATLAS-801
>                 URL: https://issues.apache.org/jira/browse/ATLAS-801
>             Project: Atlas
>          Issue Type: Improvement
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>
> All integration hooks in Atlas write messages to Kafka which are picked up by 
> the Atlas server. If communication to Kafka breaks, then this results in loss 
> of metadata messages. This can be mitigated to some extent using multiple 
> replicas for Kafka topics (see ATLAS-515). This JIRA is to see if we can make 
> this even more robust and have some form of store and forward mechanism for 
> increased fault tolerance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to