[jira] [Commented] (ATLAS-801) Atlas hooks would lose messages if Kafka is down for extended period of time

Hemanth Yamijala (JIRA) Thu, 02 Jun 2016 01:13:47 -0700

    [ 
https://issues.apache.org/jira/browse/ATLAS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311923#comment-15311923
 ]


Hemanth Yamijala commented on ATLAS-801:
----------------------------------------

The comment above would not help if Kafka is unreachable for an extended period 
of time. In this comment, I am exploring some options to address that scenario, 
along with pros and cons of the options.

Terminology:

* Host - the service from where metadata is extracted. E.g. HiveServer2, Hive 
CLI, Sqoop, Falcon, Storm
* Hook - the atlas component that is a Kafka producer, and rests in the JVM of 
the host.

h4. Local Store And Forward

h5. Mechanics

* For every hook, have a configured local storage area on disk in the host 
machine which can be used under such exceptional circumstances.
* When a message fails to be sent, start writing to this area.
* Until recovery has been effected, all further messages will be written to 
this area. In particular, no message must be sent to Kafka from the hook, if 
any failed message still exists in the local store.
* Detection of recovery can happen in a couple of ways:
** *Inline detection*: Whenever we get a new message - detect if the first 
failed message can be sent to Kafka.
** *Asynchronous detection*: A watcher thread of sorts that tries to keep 
sending first message to Kafka periodically.
* Upon detection of recovery, the hook (or watcher) will attempt to send all 
failed messages to Kafka. Bulk operations can be employed to maximize 
throughput. All sent messages must be marked. This is so that if failure 
happens in between we could continue from where we left.
* Once all locally stored messages are forwarded, then writing to Kafka will 
resume as usual, until next occurrence.
* Periodic cleaning of sent messages must be done in local store.
* Upon restart of the host component, the hook will first attempt recovery, if 
there are failed messages.

h5. Pros

* This does add the additional layer of resilience we desire, as recovery is in 
local environment.

h5. Cons

* The above solution assumes a hook component always runs on the same physical 
host. This may not be true if we are running on clients (or even servers in HA 
scenarios etc). Consider a hive hook running inside an Oozie action on a 
cluster machine - once done, it may never get launched there again - so there 
won’t be a recovery of these messages.
* Getting a configured local storage could be a problem for host components, 
particularly client side components.
* In a worst case scenario, we could end of filling local storage and cause the 
host component to fail. This can be mitigated by configuring a threshold. We 
would need alerting to notify administrators of such scenarios.
* In a distributed context, say if one of them is accessible and other is not, 
we could run into ordering issues - this is similar to the Kafka retries 
problem.

h4. Remote store and forward

h5. Mechanics

* Same as above, except the storage area is not local to the host component, 
but something distributed. (HBase??)

h5. Pros

* The distributed store can potentially provide larger volumes of storage space 
and can be centrally managed reducing operational overhead.
* Won’t have the issues of data not being available if hosts change for the 
hooks.

h5. Cons

* Distributed ordering issues is still a problem. 
* If network partitions is the cause of the original problem, writing to the 
remote store could be impacted as well.

h4. Only store, don't forward

h5. Mechanics

* Upon send failure, the hook will essentially stop sending data to Kafka, and 
keep recording to local / remote storage. In specific, no attempt to 
automatically recover is configured.
* Alerts will be mandatory to detect and correct the problem.
* We could write tools to ingest data from the stored systems.

h5. Pros

* Just avoids work and complexity in the short term.

h5. Cons

* Requires restart or refresh of configuration of ‘host’ component to let the 
hook know it can start writing to Kafka again. Alternatively, requires some 
distributed notification for the hooks to know when they can start producing to 
Kafka again.

h4. Import missing data

h5. Mechanics

* Do not do anything when writes to Kafka fails. The hooks just stop handling 
data.
* Build an import functionality to import metadata from the host components, 
like we do for hive today.

h5. Pros

* No additional complexity in hooks for handling errors
* Can potentially use the import ability in other scenarios as well - e.g. 
recover from data loss, correct data inconsistencies etc.

h5. Cons

* Some components, like Storm, don't store a lot of history to allow import 
after a reasonably long time. This could lead to loss of data.
* On a similar note, some of the components currently do not store sufficient 
information to get all metadata (for e.g. lineage in Hive) to be effective.

Already a long comment - so will stop this here and try to conclude separately.

> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>
>                 Key: ATLAS-801
>                 URL: https://issues.apache.org/jira/browse/ATLAS-801
>             Project: Atlas
>          Issue Type: Improvement
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>
> All integration hooks in Atlas write messages to Kafka which are picked up by 
> the Atlas server. If communication to Kafka breaks, then this results in loss 
> of metadata messages. This can be mitigated to some extent using multiple 
> replicas for Kafka topics (see ATLAS-515). This JIRA is to see if we can make 
> this even more robust and have some form of store and forward mechanism for 
> increased fault tolerance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ATLAS-801) Atlas hooks would lose messages if Kafka is down for extended period of time

Reply via email to