[ 
https://issues.apache.org/jira/browse/OOZIE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13913273#comment-13913273
 ] 

Ryota Egashira commented on OOZIE-1492:
---------------------------------------

Hi, I did gap analysis on several cases regarding HA support for HCat. please 
correct me if anything missing or wrong.

- Case 1 (straightforward case, no server down, and a coord action 
submitted/started on one server)
suppose coord job submitted to oozie server X, and coord action materialized 
there.
CoordMaterializeTransitionXCommand registers missing partitions of the coord 
action to dependencyCache in memory, also register the topic (table name) to 
JMS. 
when partition becomes available,  notification sent from JMS to oozie  X, and 
if all available, coord action become ready. 
fine so far.

- Case 2 (server down after coord action materialized)
suppose oozie X down after materialization of the coord action.
After while (10 min default now), the coord action will be picked up by 
RecoveryService on other oozie (say Y), queues  
CoordPushDependencyCheckXCommand, which polls HCatalog and get the list of 
current missing partitions, register them to dependency cache on oozie Y, and 
register the topic to JMS from oozie Y. (or make coord action ready if all 
available). afterwards, notification will be sent to oozie Y.

- Case 3 (server down after coord job submission but before materialization)
Coord job is in prep status, and recovery service needs to pick up (seems that 
it's not picked up in current code)

- Case 4 (no server down, but coord action picked by recovery service on other 
oozie server )
Suppose coord job submitted and coord action materialized on oozie X, but the 
coord action picked up by RecoveryService of other oozie, Y.
Similar with Case 2.  dependency cache updated and JMS topic registered from 
oozie Y. fine afterwards.
but oozie X has dependency cache outdated, and is still subscriber of the 
topic, which needs to be cleaned up.

Additional code needed for Case 3 and 4, but not much.
one disadvantage of this (relying on recovery service to pick coord action when 
oozie server down) is latency.  
also, according to messaging service team(using JMS) at Y!, no issue about the 
same topic registered from different oozie servers. (simply each oozie server 
becomes a subscriber of the topic).


> Make sure HA works with HCat and SLA notifications
> --------------------------------------------------
>
>                 Key: OOZIE-1492
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1492
>             Project: Oozie
>          Issue Type: Improvement
>          Components: HA
>    Affects Versions: trunk
>            Reporter: Robert Kanter
>
> We need to make sure HA works with HCat integration and SLA notifications. 
> Both have in-memory datastructures and HA will impact them.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to