[
https://issues.apache.org/jira/browse/TEZ-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
dobrovolskiy updated TEZ-4121:
------------------------------
Affects Version/s: 0.7.0
> Tez task infinitely hanging on [TezChild] |exec.FileSinkOperator|: FS[2]:
> records written -
> -------------------------------------------------------------------------------------------
>
> Key: TEZ-4121
> URL: https://issues.apache.org/jira/browse/TEZ-4121
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: dobrovolskiy
> Priority: Major
> Attachments: dag_details.png, grafical_view.png, task_attempts.png,
> vertex_details_reducer_2.png
>
>
> On tez 0.7.0.2.6.4.0-91 we've got an infinetly hang task without PID on the
> datanode host
> This task was successfully executed on the second attempt only after
> container killed.
> The cluster was free of resources for this application.
> Could you suggest how to fix this issue?
> The search of bugs for this strange behaviour in Jira doesn't get something.
> See screenshots from TEZ UI:
> timing of task_attempts.png
> vertex_details_reducer_2.png
> dag_details.png
> grafical_view.png
> Log of hanging task:
> {code:java}
> 2020-02-04 03:28:58,502 [INFO] [TezChild] |tez.ReduceRecordProcessor|:
> <MERGEJOIN>Id =5
> <Children>
> <SEL>Id =6
> <Children>
> <FS>Id =7
> <Children>
> <\Children>
> <Parent>Id = 6 null<\Parent>
> <\FS>
> <\Children>
> <Parent>Id = 5 null<\Parent>
> <\SEL>
> <\Children>
> <Parent><\Parent>
> <\MERGEJOIN>
> 2020-02-04 03:28:58,502 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG
> method=TezInitializeOperators start=1580775588151 end=1580776138502
> duration=550351 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
> 2020-02-04 03:28:58,502 [INFO] [TezChild] |tez.ReduceRecordProcessor|:
> Starting Output: out_Reducer 2
> 2020-02-04 03:28:58,504 [WARN] [TezChild]
> |objectinspector.StandardStructObjectInspector|: Trying to access 2 fields
> inside a list of 1 elements: [[06640113161911]]
> 2020-02-04 03:28:58,504 [WARN] [TezChild]
> |objectinspector.StandardStructObjectInspector|: ignoring similar errors.
> 2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: Final
> Path: FS
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_0
> 2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to
> temp file: FS
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_0
> 2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: New Final
> Path: FS
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_0
> 2020-02-04 03:28:58,533 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]:
> records written - 1
> 2020-02-04 03:28:58,535 [INFO] [TezChild] |orc.WriterImpl|: WIDE TABLE -
> Number of columns: 176 Chosen compression buffer size: 32768
> 2020-02-04 03:28:58,573 [INFO] [TezChild] |orc.WriterImpl|: ORC writer
> created for path:
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_0
> with stripeSize: 67108864 blockSize: 268435456 compression: ZLIB bufferSize:
> 262144
> 2020-02-04 03:28:58,599 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]:
> records written - 10
> 2020-02-04 03:28:58,775 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]:
> records written - 100
> 2020-02-04 03:28:59,396 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]:
> records written - 1000
> {code}
> {code:java}
> Log of succesfully second attempt
> 2020-02-04 07:56:51,488 [INFO] [TezChild] |tez.ReduceRecordProcessor|:
> <MERGEJOIN>Id =0
> <Children>
> <SEL>Id =1
> <Children>
> <FS>Id =2
> <Children>
> <\Children>
> <Parent>Id = 1 null<\Parent>
> <\FS>
> <\Children>
> <Parent>Id = 0 null<\Parent>
> <\SEL>
> <\Children>
> <Parent><\Parent>
> <\MERGEJOIN>
> 2020-02-04 07:56:51,504 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG
> method=TezInitializeOperators start=1580792209194 end=1580792211504
> duration=2310 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
> 2020-02-04 07:56:51,504 [INFO] [TezChild] |tez.ReduceRecordProcessor|:
> Starting Output: out_Reducer 2
> 2020-02-04 07:56:51,506 [WARN] [TezChild]
> |objectinspector.StandardStructObjectInspector|: Trying to access 2 fields
> inside a list of 1 elements: [[06640113161911]]
> 2020-02-04 07:56:51,506 [WARN] [TezChild]
> |objectinspector.StandardStructObjectInspector|: ignoring similar errors.
> 2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: Final
> Path: FS
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_1
> 2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to
> temp file: FS
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_1
> 2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: New Final
> Path: FS
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_1
> 2020-02-04 07:56:51,542 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]:
> records written - 1
> 2020-02-04 07:56:51,552 [INFO] [TezChild] |orc.WriterImpl|: WIDE TABLE -
> Number of columns: 176 Chosen compression buffer size: 32768
> 2020-02-04 07:56:51,598 [INFO] [TezChild] |orc.WriterImpl|: ORC writer
> created for path:
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_1
> with stripeSize: 67108864 blockSize: 268435456 compression: ZLIB bufferSize:
> 262144
> 2020-02-04 07:56:51,611 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]:
> records written - 10
> 2020-02-04 07:56:51,653 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]:
> records written - 100
> 2020-02-04 07:56:51,853 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]:
> records written - 1000
> 2020-02-04 07:56:53,393 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]:
> records written - 10000
> 2020-02-04 07:57:04,770 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]:
> records written - 100000
> 2020-02-04 07:58:51,014 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]:
> records written - 1000000
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG
> method=TezRunProcessor start=1580792208921 end=1580792336286 duration=127365
> from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.CommonMergeJoinOperator|:
> Closing operator MERGEJOIN[0]
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.SelectOperator|: Closing
> operator SEL[1]
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.FileSinkOperator|: Closing
> operator FS[2]
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]:
> records written - 1040870
> 2020-02-04 07:58:56,517 [INFO] [TezChild] |exec.FileSinkOperator|:
> RECORDS_OUT_1_sample.rpo_current_info_new:1040870,
> 2020-02-04 07:58:56,517 [INFO] [TezChild] |task.TezTaskRunner|: Closing task,
> taskAttemptId=attempt_1579977836338_6440_2_02_001098_1
> 2020-02-04 07:58:56,517 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting
> down Shuffle for source: Map_1
> 2020-02-04 07:58:56,517 [INFO] [TezChild]
> |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 1:
> numDmeEventsSeen=3337, numDmeEventsSeenWithNoData=3336,
> numObsoletionEventsSeen=0, updateOnClose
> 2020-02-04 07:58:56,518 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting
> down Shuffle for source: Map_3
> 2020-02-04 07:58:56,518 [INFO] [TezChild]
> |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 3:
> numDmeEventsSeen=194, numDmeEventsSeenWithNoData=193,
> numObsoletionEventsSeen=0, updateOnClose
> 2020-02-04 07:58:56,518 [INFO] [TezChild] |output.MROutput|: out_Reducer 2
> closed
> 2020-02-04 07:58:56,518 [INFO] [TezChild]
> |runtime.LogicalIOProcessorRuntimeTask|: Joining on EventRouter
> 2020-02-04 07:58:56,522 [INFO] [TezChild] |task.TezTaskRunner|: Task
> completed, taskAttemptId=attempt_1579977836338_6440_2_02_001098_1,
> fatalErrorOccurred=false
> 2020-02-04 07:58:56,528 [INFO] [TezChild]
> |runtime.LogicalIOProcessorRuntimeTask|: Final Counters for
> attempt_1579977836338_6440_2_02_001098_1: Counters: 98 [[File System Counters
> FILE_BYTES_READ=383172873, FILE_BYTES_WRITTEN=386154953,
> HDFS_BYTES_WRITTEN=252446634, HDFS_READ_OPS=3, HDFS_WRITE_OPS=3,
> HDFS_OP_CREATE=2, HDFS_OP_GET_FILE_STATUS=3,
> HDFS_OP_RENAME=1][org.apache.tez.common.counters.TaskCounter
> REDUCE_INPUT_GROUPS=1043992, REDUCE_INPUT_RECORDS=1043992,
> COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=1043992, NUM_SHUFFLED_INPUTS=2,
> NUM_SKIPPED_INPUTS=3529, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=2,
> GC_TIME_MILLIS=686, CPU_MILLISECONDS=156620,
> PHYSICAL_MEMORY_BYTES=2147483648, VIRTUAL_MEMORY_BYTES=5494951936,
> COMMITTED_HEAP_BYTES=2147483648, OUTPUT_RECORDS=0,
> ADDITIONAL_SPILLS_BYTES_WRITTEN=1468064,
> ADDITIONAL_SPILLS_BYTES_READ=383172873, SHUFFLE_BYTES=383172873,
> SHUFFLE_BYTES_DECOMPRESSED=1487594202, SHUFFLE_BYTES_TO_MEM=1468064,
> SHUFFLE_BYTES_TO_DISK=381704809, SHUFFLE_BYTES_DISK_DIRECT=0,
> NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=2574,
> MERGE_PHASE_TIME=2649, FIRST_EVENT_RECEIVED=49, LAST_EVENT_RECEIVED=318][HIVE
> CREATED_FILES=1, RECORDS_OUT_1_sample.rpo_current_info_new=1040870][Shuffle
> Errors BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0,
> WRONG_REDUCE=0][Shuffle Errors_Reducer_2_INPUT_Map_1 BAD_ID=0, CONNECTION=0,
> IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][Shuffle
> Errors_Reducer_2_INPUT_Map_3 BAD_ID=0, CONNECTION=0, IO_ERROR=0,
> WRONG_LENGTH=0, WRONG_MAP=0,
> WRONG_REDUCE=0][TaskCounter_Reducer_2_INPUT_Map_1
> ADDITIONAL_SPILLS_BYTES_READ=381704809, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
> COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=28, LAST_EVENT_RECEIVED=185,
> MERGED_MAP_OUTPUTS=1, MERGE_PHASE_TIME=2348, NUM_DISK_TO_DISK_MERGES=0,
> NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=1,
> NUM_SKIPPED_INPUTS=3336, REDUCE_INPUT_GROUPS=1040125,
> REDUCE_INPUT_RECORDS=1040125, SHUFFLE_BYTES=381704809,
> SHUFFLE_BYTES_DECOMPRESSED=1481826651, SHUFFLE_BYTES_DISK_DIRECT=0,
> SHUFFLE_BYTES_TO_DISK=381704809, SHUFFLE_BYTES_TO_MEM=0,
> SHUFFLE_PHASE_TIME=2343,
> SPILLED_RECORDS=1040125][TaskCounter_Reducer_2_INPUT_Map_3
> ADDITIONAL_SPILLS_BYTES_READ=1468064,
> ADDITIONAL_SPILLS_BYTES_WRITTEN=1468064, COMBINE_INPUT_RECORDS=0,
> FIRST_EVENT_RECEIVED=21, LAST_EVENT_RECEIVED=133, MERGED_MAP_OUTPUTS=1,
> MERGE_PHASE_TIME=301, NUM_DISK_TO_DISK_MERGES=0, NUM_FAILED_SHUFFLE_INPUTS=0,
> NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=193,
> REDUCE_INPUT_GROUPS=3867, REDUCE_INPUT_RECORDS=3867, SHUFFLE_BYTES=1468064,
> SHUFFLE_BYTES_DECOMPRESSED=5767551, SHUFFLE_BYTES_DISK_DIRECT=0,
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=1468064,
> SHUFFLE_PHASE_TIME=231,
> SPILLED_RECORDS=3867][TaskCounter_Reducer_2_OUTPUT_out_Reducer_2
> OUTPUT_RECORDS=0]]
> 2020-02-04 07:58:56,630 [INFO] [main] |common.TezUtilsInternal|: Redirecting
> log file based on addend:
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)