Rajesh Balamohan created TEZ-1947:
-------------------------------------

             Summary: Failing fast when DAG configs have wrong values can save 
cluster resources
                 Key: TEZ-1947
                 URL: https://issues.apache.org/jira/browse/TEZ-1947
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Rajesh Balamohan


It would be beneficial to do certain config checks upfront rather having fail 
later in the downstream.  For e.g, in the following example the DAG failed 
after 400+ seconds for some config issue.

{code}
Status: Running (Executing on YARN cluster with App id 
application_1421164610335_0060)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ......          KILLED    251        170        0       81       0      81
Reducer 2             FAILED   1009          0        0     1009      23    1008
--------------------------------------------------------------------------------
VERTICES: 00/02  [===>>-----------------------] 13%   ELAPSED TIME: 449.01 s
--------------------------------------------------------------------------------
Status: Failed
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1421164610335_0060_1_01, 
diagnostics=[Task failed, taskId=task_1421164610335_0060_1_01_000004, 
diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running 
task:java.lang.RuntimeException: Invlaid configuration: maxSingleShuffleLimit 
should be less than mergeThresholdmaxSingleShuffleLimit: 238251152, 
mergeThreshold: 148668720
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.<init>(MergeManager.java:260)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.<init>(Shuffle.java:206)
        at 
org.apache.tez.runtime.library.input.OrderedGroupedKVInput.start(OrderedGroupedKVInput.java:124)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$StartInputCallable.call(LogicalIOProcessorRuntimeTask.java:405)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$StartInputCallable.call(LogicalIOProcessorRuntimeTask.java:393)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
], TaskAttempt 1 failed, info=[Error: Failure while running 
task:java.lang.RuntimeException: Invlaid configuration: maxSingleShuffleLimit 
should be less than mergeThresholdmaxSingleShuffleLimit: 238251152, 
mergeThreshold: 148668720
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.<init>(MergeManager.java:260)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.<init>(Shuffle.java:206)
        at 
org.apache.tez.runtime.library.input.OrderedGroupedKVInput.start(OrderedGroupedKVInput.java:124)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$StartInputCallable.call(LogicalIOProcessorRuntimeTask.java:405)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$StartInputCallable.call(LogicalIOProcessorRuntimeTask.java:393)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to