[
https://issues.apache.org/jira/browse/ATLAS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311923#comment-15311923
]
Hemanth Yamijala commented on ATLAS-801:
----------------------------------------
The comment above would not help if Kafka is unreachable for an extended period
of time. In this comment, I am exploring some options to address that scenario,
along with pros and cons of the options.
Terminology:
* Host - the service from where metadata is extracted. E.g. HiveServer2, Hive
CLI, Sqoop, Falcon, Storm
* Hook - the atlas component that is a Kafka producer, and rests in the JVM of
the host.
h4. Local Store And Forward
h5. Mechanics
* For every hook, have a configured local storage area on disk in the host
machine which can be used under such exceptional circumstances.
* When a message fails to be sent, start writing to this area.
* Until recovery has been effected, all further messages will be written to
this area. In particular, no message must be sent to Kafka from the hook, if
any failed message still exists in the local store.
* Detection of recovery can happen in a couple of ways:
** *Inline detection*: Whenever we get a new message - detect if the first
failed message can be sent to Kafka.
** *Asynchronous detection*: A watcher thread of sorts that tries to keep
sending first message to Kafka periodically.
* Upon detection of recovery, the hook (or watcher) will attempt to send all
failed messages to Kafka. Bulk operations can be employed to maximize
throughput. All sent messages must be marked. This is so that if failure
happens in between we could continue from where we left.
* Once all locally stored messages are forwarded, then writing to Kafka will
resume as usual, until next occurrence.
* Periodic cleaning of sent messages must be done in local store.
* Upon restart of the host component, the hook will first attempt recovery, if
there are failed messages.
h5. Pros
* This does add the additional layer of resilience we desire, as recovery is in
local environment.
h5. Cons
* The above solution assumes a hook component always runs on the same physical
host. This may not be true if we are running on clients (or even servers in HA
scenarios etc). Consider a hive hook running inside an Oozie action on a
cluster machine - once done, it may never get launched there again - so there
won’t be a recovery of these messages.
* Getting a configured local storage could be a problem for host components,
particularly client side components.
* In a worst case scenario, we could end of filling local storage and cause the
host component to fail. This can be mitigated by configuring a threshold. We
would need alerting to notify administrators of such scenarios.
* In a distributed context, say if one of them is accessible and other is not,
we could run into ordering issues - this is similar to the Kafka retries
problem.
h4. Remote store and forward
h5. Mechanics
* Same as above, except the storage area is not local to the host component,
but something distributed. (HBase??)
h5. Pros
* The distributed store can potentially provide larger volumes of storage space
and can be centrally managed reducing operational overhead.
* Won’t have the issues of data not being available if hosts change for the
hooks.
h5. Cons
* Distributed ordering issues is still a problem.
* If network partitions is the cause of the original problem, writing to the
remote store could be impacted as well.
h4. Only store, don't forward
h5. Mechanics
* Upon send failure, the hook will essentially stop sending data to Kafka, and
keep recording to local / remote storage. In specific, no attempt to
automatically recover is configured.
* Alerts will be mandatory to detect and correct the problem.
* We could write tools to ingest data from the stored systems.
h5. Pros
* Just avoids work and complexity in the short term.
h5. Cons
* Requires restart or refresh of configuration of ‘host’ component to let the
hook know it can start writing to Kafka again. Alternatively, requires some
distributed notification for the hooks to know when they can start producing to
Kafka again.
h4. Import missing data
h5. Mechanics
* Do not do anything when writes to Kafka fails. The hooks just stop handling
data.
* Build an import functionality to import metadata from the host components,
like we do for hive today.
h5. Pros
* No additional complexity in hooks for handling errors
* Can potentially use the import ability in other scenarios as well - e.g.
recover from data loss, correct data inconsistencies etc.
h5. Cons
* Some components, like Storm, don't store a lot of history to allow import
after a reasonably long time. This could lead to loss of data.
* On a similar note, some of the components currently do not store sufficient
information to get all metadata (for e.g. lineage in Hive) to be effective.
Already a long comment - so will stop this here and try to conclude separately.
> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>
> Key: ATLAS-801
> URL: https://issues.apache.org/jira/browse/ATLAS-801
> Project: Atlas
> Issue Type: Improvement
> Reporter: Hemanth Yamijala
> Assignee: Hemanth Yamijala
>
> All integration hooks in Atlas write messages to Kafka which are picked up by
> the Atlas server. If communication to Kafka breaks, then this results in loss
> of metadata messages. This can be mitigated to some extent using multiple
> replicas for Kafka topics (see ATLAS-515). This JIRA is to see if we can make
> this even more robust and have some form of store and forward mechanism for
> increased fault tolerance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)