[
https://issues.apache.org/jira/browse/HIVE-21912?focusedWorklogId=271768&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-271768
]
ASF GitHub Bot logged work on HIVE-21912:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 03/Jul/19 17:45
Start Date: 03/Jul/19 17:45
Worklog Time Spent: 10m
Work Description: odraese commented on pull request #698: HIVE-21912:
Implement DisablingDaemonStatisticsHandler
URL: https://github.com/apache/hive/pull/698#discussion_r300074321
##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4358,6 +4358,40 @@ private static void
populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
"The listener which is called when new Llap Daemon statistics is
received on AM side.\n" +
"The listener should implement the " +
"org.apache.hadoop.hive.llap.tezplugins.metrics.LlapMetricsListener
interface."),
+ LLAP_TASK_SCHEDULER_BLACKLISTING_METRICS_LISTENER_MIN_SERVED_TASKS(
+
"hive.llap.task.scheduler.blacklisting.metrics.listener.min.served.tasks", 2000,
+ "If the number of tasks served by a node is below this number then we
will ignore the node\n" +
+ "when calculating the status of the cluster.\n" +
+ "Only used if
hive.llap.task.scheduler.am.collect.daemon.metrics.listener is set to\n" +
+
"org.apache.hadoop.hive.llap.tezplugins.metrics.BlacklistingLlapMetricsListener"),
+ LLAP_TASK_SCHEDULER_BLACKLISTING_METRICS_LISTENER_MIN_CHANGE_DELAY(
+
"hive.llap.task.scheduler.blacklisting.metrics.listener.min.change.delay",
"300s",
+ new TimeValidator(TimeUnit.SECONDS),
+ "The minimum time which should elapse between blacklisting nodes, in
seconds.\n" +
+ "Only used if
hive.llap.task.scheduler.am.collect.daemon.metrics.listener is set to\n" +
+
"org.apache.hadoop.hive.llap.tezplugins.metrics.BlacklistingLlapMetricsListener"),
+ LLAP_TASK_SCHEDULER_BLACKLISTING_METRICS_LISTENER_TIME_THRESHOLD(
+ "hive.llap.task.scheduler.blacklisting.metrics.listener.time.threshold",
1.5f,
+ "If the average response time of this node divided by the average
response time of all the other nodes\n" +
+ "is greater than this threshold and the other conditions are satisfied
too,\n" +
+ "then this node should be blacklisted.\n" +
+ "Only used if
hive.llap.task.scheduler.am.collect.daemon.metrics.listener is set to\n" +
+
"org.apache.hadoop.hive.llap.tezplugins.metrics.BlacklistingLlapMetricsListener"),
+ LLAP_TASK_SCHEDULER_BLACKLISTING_METRICS_LISTENER_EMPTY_EXECUTORS(
+
"hive.llap.task.scheduler.blacklisting.metrics.listener.empty.executors.threshold",
2.0f,
Review comment:
The default of 2x required empty executors doesn't make much sense to me. At
best, a factor of 1x would be good because we still would be able to blacklist
w/o any negative impact (enough empty executors available to cover). But even
then, I would argue that values <1.0 make sense. If my average task execution
time on a healthy node is 100ms and on a limping node is 200ms, then it is
better to send the tasks (even with queueing) to the healthy nodes instead of
continuing with the limping node. I would actually consider this "empty
executors = free capacity" protection mechanism as solely optional and default
to zero.... @t3rmin4t0r - any thoughts on this?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 271768)
Time Spent: 1h (was: 50m)
> Implement BlacklistingLlapMetricsListener
> -----------------------------------------
>
> Key: HIVE-21912
> URL: https://issues.apache.org/jira/browse/HIVE-21912
> Project: Hive
> Issue Type: Sub-task
> Components: llap, Tez
> Reporter: Peter Vary
> Assignee: Peter Vary
> Priority: Major
> Labels: pull-request-available
> Attachments: HIVE-21912.patch, HIVE-21912.wip-2.patch,
> HIVE-21912.wip.patch
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> We should implement a DaemonStatisticsHandler which:
> * If a node average response time is bigger than 150% (configurable) of the
> other nodes
> * If the other nodes has enough empty executors to handle the requests
> Then disables the limping node.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)