[
https://issues.apache.org/jira/browse/HIVE-22359?focusedWorklogId=389041&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-389041
]
ASF GitHub Bot logged work on HIVE-22359:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 18/Feb/20 20:08
Start Date: 18/Feb/20 20:08
Worklog Time Spent: 10m
Work Description: prasanthj commented on pull request #917: HIVE-22359:
LLAP: when a node restarts with the exact same host/port in kubernetes it is
not detected as a task failure
URL: https://github.com/apache/hive/pull/917
In kubernete environments, the hostnames and ports are same for LLAP service
but IP address of pods can change. There are some assumptions in LLAP that
handles hostname:port and caches connections based on that. Also AM thinks that
certain host is running some task attempts but when the LLAP pod restarts all
the tasks on that node gets killed or replaced with new tasks in which case
LLAP will heartbeat with different task attempts which AM does not expect.
This PR fixes 2 issues
- Includes IP address in hostId that is used for caching RPC connections
- When AM expects some tasks to be there on some node and if does not exists
then it will kill those task attempts so that it gets rescheduled.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 389041)
Remaining Estimate: 0h
Time Spent: 10m
> LLAP: when a node restarts with the exact same host/port in kubernetes it is
> not detected as a task failure
> -----------------------------------------------------------------------------------------------------------
>
> Key: HIVE-22359
> URL: https://issues.apache.org/jira/browse/HIVE-22359
> Project: Hive
> Issue Type: Bug
> Reporter: Gopal Vijayaraghavan
> Assignee: Prasanth Jayachandran
> Priority: Major
> Labels: pull-request-available
> Attachments: HIVE-22359.1.patch
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> {code}
> │ <14>1 2019-10-16T22:16:39.233Z
> query-coordinator-0-5.query-coordinator-0-service.compute-1569601454-l2x9.svc.cluster.local
> query-coordinator 1 461e5ad9-f05f-11e9-85f7-06e84765763e [mdc@18060
> class="te │
> │ zplugins.LlapTaskCommunicator" level="INFO" thread="IPC Server handler 4 on
> 33333"] The tasks we expected to be on the node are not there:
> attempt_1569601631911_0000_1_04_000034_0, attempt_15696016319 │
> │ 11_0000_1_04_000071_0, attempt_1569601631911_0000_1_04_000191_0,
> attempt_1569601631911_0000_1_04_000211_0,
> attempt_1569601631911_0000_1_04_000229_0,
> attempt_1569601631911_0000_1_04_000231_0, attempt_1 │
> │ 569601631911_0000_1_04_000235_0, attempt_1569601631911_0000_1_04_000242_0,
> attempt_1569601631911_0000_1_04_000160_1,
> attempt_1569601631911_0000_1_04_000012_2,
> attempt_1569601631911_0000_1_04_000003_2, │
> │ attempt_1569601631911_0000_1_04_000056_2,
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)