[
https://issues.apache.org/jira/browse/OOZIE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13913273#comment-13913273
]
Ryota Egashira commented on OOZIE-1492:
---------------------------------------
Hi, I did gap analysis on several cases regarding HA support for HCat. please
correct me if anything missing or wrong.
- Case 1 (straightforward case, no server down, and a coord action
submitted/started on one server)
suppose coord job submitted to oozie server X, and coord action materialized
there.
CoordMaterializeTransitionXCommand registers missing partitions of the coord
action to dependencyCache in memory, also register the topic (table name) to
JMS.
when partition becomes available, notification sent from JMS to oozie X, and
if all available, coord action become ready.
fine so far.
- Case 2 (server down after coord action materialized)
suppose oozie X down after materialization of the coord action.
After while (10 min default now), the coord action will be picked up by
RecoveryService on other oozie (say Y), queues
CoordPushDependencyCheckXCommand, which polls HCatalog and get the list of
current missing partitions, register them to dependency cache on oozie Y, and
register the topic to JMS from oozie Y. (or make coord action ready if all
available). afterwards, notification will be sent to oozie Y.
- Case 3 (server down after coord job submission but before materialization)
Coord job is in prep status, and recovery service needs to pick up (seems that
it's not picked up in current code)
- Case 4 (no server down, but coord action picked by recovery service on other
oozie server )
Suppose coord job submitted and coord action materialized on oozie X, but the
coord action picked up by RecoveryService of other oozie, Y.
Similar with Case 2. dependency cache updated and JMS topic registered from
oozie Y. fine afterwards.
but oozie X has dependency cache outdated, and is still subscriber of the
topic, which needs to be cleaned up.
Additional code needed for Case 3 and 4, but not much.
one disadvantage of this (relying on recovery service to pick coord action when
oozie server down) is latency.
also, according to messaging service team(using JMS) at Y!, no issue about the
same topic registered from different oozie servers. (simply each oozie server
becomes a subscriber of the topic).
> Make sure HA works with HCat and SLA notifications
> --------------------------------------------------
>
> Key: OOZIE-1492
> URL: https://issues.apache.org/jira/browse/OOZIE-1492
> Project: Oozie
> Issue Type: Improvement
> Components: HA
> Affects Versions: trunk
> Reporter: Robert Kanter
>
> We need to make sure HA works with HCat integration and SLA notifications.
> Both have in-memory datastructures and HA will impact them.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)