[ 
https://issues.apache.org/jira/browse/TEZ-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor resolved TEZ-4121.
-------------------------------
    Resolution: Incomplete

let me resolve this as "Incomplete" due to inactivity
hanging hive on tez task in ReduceSinkOperator is typically an issue in hive, 
where the records flow thorugh the operator pipeline very slowly (which is 
usually because of data skew)

> Tez task infinitely hanging on [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written -
> -------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4121
>                 URL: https://issues.apache.org/jira/browse/TEZ-4121
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: dobrovolskiy
>            Priority: Major
>         Attachments: dag_details.png, grafical_view.png, task_attempts.png, 
> vertex_details_reducer_2.png
>
>
> On tez 0.7.0.2.6.4.0-91 we've got an infinetly hang task without PID on the 
> datanode host
>  This task was successfully executed on the second attempt only after 
> container killed.
>  The cluster was free of resources for this application.
>  Could you suggest how to fix this issue? 
>  The search of bugs for this strange behaviour in Jira doesn't get something.
> See screenshots from TEZ UI:
>  timing of task_attempts.png
>  vertex_details_reducer_2.png
>  dag_details.png
>  grafical_view.png
> Log of hanging task:
> {code:java}
> 2020-02-04 03:28:58,502 [INFO] [TezChild] |tez.ReduceRecordProcessor|: 
> <MERGEJOIN>Id =5
>   <Children>
>     <SEL>Id =6
>       <Children>
>         <FS>Id =7
>           <Children>
>           <\Children>
>           <Parent>Id = 6 null<\Parent>
>         <\FS>
>       <\Children>
>       <Parent>Id = 5 null<\Parent>
>     <\SEL>
>   <\Children>
>   <Parent><\Parent>
> <\MERGEJOIN>
> 2020-02-04 03:28:58,502 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG 
> method=TezInitializeOperators start=1580775588151 end=1580776138502 
> duration=550351 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
> 2020-02-04 03:28:58,502 [INFO] [TezChild] |tez.ReduceRecordProcessor|: 
> Starting Output: out_Reducer 2
> 2020-02-04 03:28:58,504 [WARN] [TezChild] 
> |objectinspector.StandardStructObjectInspector|: Trying to access 2 fields 
> inside a list of 1 elements: [[06640113161911]]
> 2020-02-04 03:28:58,504 [WARN] [TezChild] 
> |objectinspector.StandardStructObjectInspector|: ignoring similar errors.
> 2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: Final 
> Path: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_0
> 2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to 
> temp file: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_0
> 2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: New Final 
> Path: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_0
> 2020-02-04 03:28:58,533 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
> records written - 1
> 2020-02-04 03:28:58,535 [INFO] [TezChild] |orc.WriterImpl|: WIDE TABLE - 
> Number of columns: 176 Chosen compression buffer size: 32768
> 2020-02-04 03:28:58,573 [INFO] [TezChild] |orc.WriterImpl|: ORC writer 
> created for path: 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_0
>  with stripeSize: 67108864 blockSize: 268435456 compression: ZLIB bufferSize: 
> 262144
> 2020-02-04 03:28:58,599 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
> records written - 10
> 2020-02-04 03:28:58,775 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
> records written - 100
> 2020-02-04 03:28:59,396 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
> records written - 1000
> {code}
> Log of succesfully second attempt
> {code:java}
> 2020-02-04 07:56:51,488 [INFO] [TezChild] |tez.ReduceRecordProcessor|: 
> <MERGEJOIN>Id =0
>   <Children>
>     <SEL>Id =1
>       <Children>
>         <FS>Id =2
>           <Children>
>           <\Children>
>           <Parent>Id = 1 null<\Parent>
>         <\FS>
>       <\Children>
>       <Parent>Id = 0 null<\Parent>
>     <\SEL>
>   <\Children>
>   <Parent><\Parent>
> <\MERGEJOIN>
> 2020-02-04 07:56:51,504 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG 
> method=TezInitializeOperators start=1580792209194 end=1580792211504 
> duration=2310 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
> 2020-02-04 07:56:51,504 [INFO] [TezChild] |tez.ReduceRecordProcessor|: 
> Starting Output: out_Reducer 2
> 2020-02-04 07:56:51,506 [WARN] [TezChild] 
> |objectinspector.StandardStructObjectInspector|: Trying to access 2 fields 
> inside a list of 1 elements: [[06640113161911]]
> 2020-02-04 07:56:51,506 [WARN] [TezChild] 
> |objectinspector.StandardStructObjectInspector|: ignoring similar errors.
> 2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: Final 
> Path: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_1
> 2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to 
> temp file: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_1
> 2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: New Final 
> Path: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_1
> 2020-02-04 07:56:51,542 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 1
> 2020-02-04 07:56:51,552 [INFO] [TezChild] |orc.WriterImpl|: WIDE TABLE - 
> Number of columns: 176 Chosen compression buffer size: 32768
> 2020-02-04 07:56:51,598 [INFO] [TezChild] |orc.WriterImpl|: ORC writer 
> created for path: 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_1
>  with stripeSize: 67108864 blockSize: 268435456 compression: ZLIB bufferSize: 
> 262144
> 2020-02-04 07:56:51,611 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 10
> 2020-02-04 07:56:51,653 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 100
> 2020-02-04 07:56:51,853 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 1000
> 2020-02-04 07:56:53,393 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 10000
> 2020-02-04 07:57:04,770 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 100000
> 2020-02-04 07:58:51,014 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 1000000
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG 
> method=TezRunProcessor start=1580792208921 end=1580792336286 duration=127365 
> from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.CommonMergeJoinOperator|: 
> Closing operator MERGEJOIN[0]
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.SelectOperator|: Closing 
> operator SEL[1]
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.FileSinkOperator|: Closing 
> operator FS[2]
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 1040870
> 2020-02-04 07:58:56,517 [INFO] [TezChild] |exec.FileSinkOperator|: 
> RECORDS_OUT_1_sample.rpo_current_info_new:1040870, 
> 2020-02-04 07:58:56,517 [INFO] [TezChild] |task.TezTaskRunner|: Closing task, 
> taskAttemptId=attempt_1579977836338_6440_2_02_001098_1
> 2020-02-04 07:58:56,517 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting 
> down Shuffle for source: Map_1
> 2020-02-04 07:58:56,517 [INFO] [TezChild] 
> |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 1: 
> numDmeEventsSeen=3337, numDmeEventsSeenWithNoData=3336, 
> numObsoletionEventsSeen=0, updateOnClose
> 2020-02-04 07:58:56,518 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting 
> down Shuffle for source: Map_3
> 2020-02-04 07:58:56,518 [INFO] [TezChild] 
> |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 3: 
> numDmeEventsSeen=194, numDmeEventsSeenWithNoData=193, 
> numObsoletionEventsSeen=0, updateOnClose
> 2020-02-04 07:58:56,518 [INFO] [TezChild] |output.MROutput|: out_Reducer 2 
> closed
> 2020-02-04 07:58:56,518 [INFO] [TezChild] 
> |runtime.LogicalIOProcessorRuntimeTask|: Joining on EventRouter
> 2020-02-04 07:58:56,522 [INFO] [TezChild] |task.TezTaskRunner|: Task 
> completed, taskAttemptId=attempt_1579977836338_6440_2_02_001098_1, 
> fatalErrorOccurred=false
> 2020-02-04 07:58:56,528 [INFO] [TezChild] 
> |runtime.LogicalIOProcessorRuntimeTask|: Final Counters for 
> attempt_1579977836338_6440_2_02_001098_1: Counters: 98 [[File System Counters 
> FILE_BYTES_READ=383172873, FILE_BYTES_WRITTEN=386154953, 
> HDFS_BYTES_WRITTEN=252446634, HDFS_READ_OPS=3, HDFS_WRITE_OPS=3, 
> HDFS_OP_CREATE=2, HDFS_OP_GET_FILE_STATUS=3, 
> HDFS_OP_RENAME=1][org.apache.tez.common.counters.TaskCounter 
> REDUCE_INPUT_GROUPS=1043992, REDUCE_INPUT_RECORDS=1043992, 
> COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=1043992, NUM_SHUFFLED_INPUTS=2, 
> NUM_SKIPPED_INPUTS=3529, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=2, 
> GC_TIME_MILLIS=686, CPU_MILLISECONDS=156620, 
> PHYSICAL_MEMORY_BYTES=2147483648, VIRTUAL_MEMORY_BYTES=5494951936, 
> COMMITTED_HEAP_BYTES=2147483648, OUTPUT_RECORDS=0, 
> ADDITIONAL_SPILLS_BYTES_WRITTEN=1468064, 
> ADDITIONAL_SPILLS_BYTES_READ=383172873, SHUFFLE_BYTES=383172873, 
> SHUFFLE_BYTES_DECOMPRESSED=1487594202, SHUFFLE_BYTES_TO_MEM=1468064, 
> SHUFFLE_BYTES_TO_DISK=381704809, SHUFFLE_BYTES_DISK_DIRECT=0, 
> NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=2574, 
> MERGE_PHASE_TIME=2649, FIRST_EVENT_RECEIVED=49, LAST_EVENT_RECEIVED=318][HIVE 
> CREATED_FILES=1, RECORDS_OUT_1_sample.rpo_current_info_new=1040870][Shuffle 
> Errors BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, 
> WRONG_REDUCE=0][Shuffle Errors_Reducer_2_INPUT_Map_1 BAD_ID=0, CONNECTION=0, 
> IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][Shuffle 
> Errors_Reducer_2_INPUT_Map_3 BAD_ID=0, CONNECTION=0, IO_ERROR=0, 
> WRONG_LENGTH=0, WRONG_MAP=0, 
> WRONG_REDUCE=0][TaskCounter_Reducer_2_INPUT_Map_1 
> ADDITIONAL_SPILLS_BYTES_READ=381704809, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=28, LAST_EVENT_RECEIVED=185, 
> MERGED_MAP_OUTPUTS=1, MERGE_PHASE_TIME=2348, NUM_DISK_TO_DISK_MERGES=0, 
> NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=1, 
> NUM_SKIPPED_INPUTS=3336, REDUCE_INPUT_GROUPS=1040125, 
> REDUCE_INPUT_RECORDS=1040125, SHUFFLE_BYTES=381704809, 
> SHUFFLE_BYTES_DECOMPRESSED=1481826651, SHUFFLE_BYTES_DISK_DIRECT=0, 
> SHUFFLE_BYTES_TO_DISK=381704809, SHUFFLE_BYTES_TO_MEM=0, 
> SHUFFLE_PHASE_TIME=2343, 
> SPILLED_RECORDS=1040125][TaskCounter_Reducer_2_INPUT_Map_3 
> ADDITIONAL_SPILLS_BYTES_READ=1468064, 
> ADDITIONAL_SPILLS_BYTES_WRITTEN=1468064, COMBINE_INPUT_RECORDS=0, 
> FIRST_EVENT_RECEIVED=21, LAST_EVENT_RECEIVED=133, MERGED_MAP_OUTPUTS=1, 
> MERGE_PHASE_TIME=301, NUM_DISK_TO_DISK_MERGES=0, NUM_FAILED_SHUFFLE_INPUTS=0, 
> NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=193, 
> REDUCE_INPUT_GROUPS=3867, REDUCE_INPUT_RECORDS=3867, SHUFFLE_BYTES=1468064, 
> SHUFFLE_BYTES_DECOMPRESSED=5767551, SHUFFLE_BYTES_DISK_DIRECT=0, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=1468064, 
> SHUFFLE_PHASE_TIME=231, 
> SPILLED_RECORDS=3867][TaskCounter_Reducer_2_OUTPUT_out_Reducer_2 
> OUTPUT_RECORDS=0]]
> 2020-02-04 07:58:56,630 [INFO] [main] |common.TezUtilsInternal|: Redirecting 
> log file based on addend: 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to