[ 
https://issues.apache.org/jira/browse/HIVE-18952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-18952:
------------------------------------
    Status: Patch Available  (was: Open)

Updated the patch to actually work.
I was able to pass AM between HS2s on the cluster, and run query on it using 
the changing HS2s. 
One question is whether stop() should actually kill AMs. Practical failover 
situations like crashes and notLeader due to some network issues would still be 
ok, but I wonder if it's ever desirable to have AMs survive the graceful stop, 
i.e. if graceful stop/ctrl-c can be involved in a valid HA scenario. For now 
I'm keeping it as is for testing, I will make a small change to distinguish 
notLeader from stop otherwise.

Another thing that may or may not be related is that when connecting, I see 
some killed tasks and weird logging with two task communicators, one of which 
is asking many containers to die. I'm not sure if it's normal for both LLAP and 
Tez communicator to run together.
{noformat}
2018-03-20 21:58:43,774 [WARN] [IPC Server handler 28 on 45657] 
|app.TezTaskCommunicatorImpl|: Received task heartbeat from unknown container 
with id: container_222212222_1914_01_000768, asking it to die
2018-03-20 21:58:43,778 [INFO] [TaskCommunicator # 2] 
|tezplugins.LlapTaskCommunicator|: Successfully launched task: 
attempt_1520459437616_1914_3_04_000161_0
{noformat}
Queries still succeed though.

Will test more tomorrow to see if this is related to the patch/Tez patch, or 
something else.

Patch should be ready to review.
Age calculation improvement will be in a follow-up jira

> Tez session disconnect and reconnect on HS2 HA failover
> -------------------------------------------------------
>
>                 Key: HIVE-18952
>                 URL: https://issues.apache.org/jira/browse/HIVE-18952
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 3.0.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HIVE-18952.01.patch, HIVE-18952.patch
>
>
> Now that TEZ-3892 is committed, HIVE-18281 can make use of tez session 
> disconnect and reconnect on HA failover.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to