[
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jaume M updated TEZ-3976:
-------------------------
Summary: ShuffleManager reporting too many errors (was: {{ShuffleManager}}
reporting too many errors)
> ShuffleManager reporting too many errors
> ----------------------------------------
>
> Key: TEZ-3976
> URL: https://issues.apache.org/jira/browse/TEZ-3976
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jaume M
> Priority: Major
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO [Fetcher_B {Reducer_5} #0 ()]
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5:
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701,
> attemptNumber=0,
> pathComponent=attempt_1529044441963_0021_34_01_000701_0_12541_0, spillType=2,
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701,
> attemptNumber=0,
> pathComponent=attempt_1529044441963_0021_34_01_000701_0_12541_0, spillType=2,
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN [Fetcher_B {Reducer_5} #1 ()]
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0,
> pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2,
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO [Fetcher_B {Reducer_5} #1 ()]
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5:
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589,
> attemptNumber=0,
> pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2,
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589,
> attemptNumber=0,
> pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2,
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to
> OOM after around 30 minutes and around 10 million shuffle input errors (and
> 10 million lines like the previous ones). When the ShufflerManager is closed
> and the counters reported there are many shuffle input errors, some of those
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988 INFO [TezTR-441963_21_34_4_0_4
> (1529044441963_0021_34_04_000000_4)] runtime.LogicalIOProcessorRuntimeTask:
> Final Counters for attempt_1529044441963_0021_34_04_000000_4: Counters: 43
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0,
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965,
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0,
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483,
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0,
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603,
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0,
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516,
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1,
> RECORDS_OUT_OPERATOR_GBY_159=1,
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
> FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26,
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965,
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563,
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947,
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616,
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146,
> OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1,
> SPILLED_RECORDS=0]]
> 2018-06-15T17:46:32,271 INFO [TezTR-441963_21_34_3_15_1 ()]
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Final Counters for
> attempt_1529044441963_0021_34_03_000015_1: Counters: 87 [[File System
> Counters FILE_BYTES_READ=0, FILE_BYTES_WRITTEN=0, FILE_READ_OPS=0,
> FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=2344929,
> HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=5, HDFS_LARGE_READ_OPS=0,
> HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter
> SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=1, NUM_FAILED_SHUFFLE_INPUTS=105195,
> INPUT_RECORDS_PROCESSED=397, INPUT_SPLIT_LENGTH_BYTES=21563271,
> OUTPUT_RECORDS=15737, OUTPUT_LARGE_RECORDS=0, OUTPUT_BYTES=1235818,
> OUTPUT_BYTES_WITH_OVERHEAD=1267307, OUTPUT_BYTES_PHYSICAL=357520,
> ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILLS_BYTES_READ=0,
> ADDITIONAL_SPILL_COUNT=0, SHUFFLE_BYTES=31, SHUFFLE_BYTES_DECOMPRESSED=17,
> SHUFFLE_BYTES_TO_MEM=31, SHUFFLE_BYTES_TO_DISK=0,
> SHUFFLE_BYTES_DISK_DIRECT=0, SHUFFLE_PHASE_TIME=50525,
> FIRST_EVENT_RECEIVED=9, LAST_EVENT_RECEIVED=61][HIVE DESERIALIZE_ERRORS=0,
> RECORDS_IN_Map_11=395611, RECORDS_OUT_INTERMEDIATE_Map_11=15737,
> RECORDS_OUT_OPERATOR_FIL_152=395611, RECORDS_OUT_OPERATOR_GBY_157=1,
> RECORDS_OUT_OPERATOR_MAPJOIN_154=15736, RECORDS_OUT_OPERATOR_MAP_0=0,
> RECORDS_OUT_OPERATOR_RS_155=15736, RECORDS_OUT_OPERATOR_RS_158=1,
> RECORDS_OUT_OPERATOR_SEL_153=395611, RECORDS_OUT_OPERATOR_SEL_156=15736,
> RECORDS_OUT_OPERATOR_TS_26=395611][TaskCounter_Map_11_INPUT_Map_13
> FIRST_EVENT_RECEIVED=9, INPUT_RECORDS_PROCESSED=1, LAST_EVENT_RECEIVED=61,
> NUM_FAILED_SHUFFLE_INPUTS=105195, NUM_SHUFFLED_INPUTS=1, SHUFFLE_BYTES=31,
> SHUFFLE_BYTES_DECOMPRESSED=17, SHUFFLE_BYTES_DISK_DIRECT=0,
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=31,
> SHUFFLE_PHASE_TIME=50525][TaskCounter_Map_11_INPUT_supplier
> INPUT_RECORDS_PROCESSED=396,
> INPUT_SPLIT_LENGTH_BYTES=21563271][TaskCounter_Map_11_OUTPUT_Reducer_12
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779474, OUTPUT_BYTES_PHYSICAL=164787,
> OUTPUT_BYTES_WITH_OVERHEAD=779485, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1,
> SPILLED_RECORDS=0][TaskCounter_Map_11_OUTPUT_Reducer_6
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=456344, OUTPUT_BYTES_PHYSICAL=192733,
> OUTPUT_BYTES_WITH_OVERHEAD=487822, OUTPUT_LARGE_RECORDS=0,
> OUTPUT_RECORDS=15736,
> SPILLED_RECORDS=0][org.apache.hadoop.hive.llap.counters.LlapIOCounters
> ALLOCATED_BYTES=9633792, ALLOCATED_USED_BYTES=7976706, CACHE_HIT_BYTES=0,
> CACHE_MISS_BYTES=2344364, CONSUMER_TIME_NS=1136392475,
> DECODE_TIME_NS=140377915, HDFS_TIME_NS=145825282, METADATA_CACHE_MISS=4,
> NUM_DECODED_BATCHES=41, NUM_VECTOR_BATCHES=396, ROWS_EMITTED=395611,
> SELECTED_ROWGROUPS=41, TOTAL_IO_TIME_NS=1234990631]]
> {code}
> I think this is happening because the fetcher is in [this
> loop|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L366]
> which is started
> [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L69]
> and not stopped until the [processor is
> closed|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java#L387]
> (which is called from
> [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L83]
> or
> [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L110]).
> In {{ShuffleManager}} we seem to keep track of
> {{TaskCounter.NUM_FAILED_SHUFFLE_INPUTS}} but nothing is done when the value
> gets too high. Maybe something similar to {{ShuffleScheduler}} should be done
> where the fetchers are only retried a certain amount of times.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)