[ 
https://issues.apache.org/jira/browse/TEZ-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dobrovolskiy updated TEZ-4121:
------------------------------
    Affects Version/s: 0.7.0

> Tez task infinitely hanging on [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written -
> -------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4121
>                 URL: https://issues.apache.org/jira/browse/TEZ-4121
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: dobrovolskiy
>            Priority: Major
>         Attachments: dag_details.png, grafical_view.png, task_attempts.png, 
> vertex_details_reducer_2.png
>
>
> On tez 0.7.0.2.6.4.0-91 we've got an infinetly hang task without PID on the 
> datanode host
>  This task was successfully executed on the second attempt only after 
> container killed.
> The cluster was free of resources for this application.
> Could you suggest how to fix this issue? 
> The search of bugs for this strange behaviour in Jira doesn't get something.
> See screenshots from TEZ UI:
> timing of task_attempts.png
> vertex_details_reducer_2.png
> dag_details.png
> grafical_view.png
> Log of hanging task:
> {code:java}
> 2020-02-04 03:28:58,502 [INFO] [TezChild] |tez.ReduceRecordProcessor|: 
> <MERGEJOIN>Id =5
>   <Children>
>     <SEL>Id =6
>       <Children>
>         <FS>Id =7
>           <Children>
>           <\Children>
>           <Parent>Id = 6 null<\Parent>
>         <\FS>
>       <\Children>
>       <Parent>Id = 5 null<\Parent>
>     <\SEL>
>   <\Children>
>   <Parent><\Parent>
> <\MERGEJOIN>
> 2020-02-04 03:28:58,502 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG 
> method=TezInitializeOperators start=1580775588151 end=1580776138502 
> duration=550351 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
> 2020-02-04 03:28:58,502 [INFO] [TezChild] |tez.ReduceRecordProcessor|: 
> Starting Output: out_Reducer 2
> 2020-02-04 03:28:58,504 [WARN] [TezChild] 
> |objectinspector.StandardStructObjectInspector|: Trying to access 2 fields 
> inside a list of 1 elements: [[06640113161911]]
> 2020-02-04 03:28:58,504 [WARN] [TezChild] 
> |objectinspector.StandardStructObjectInspector|: ignoring similar errors.
> 2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: Final 
> Path: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_0
> 2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to 
> temp file: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_0
> 2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: New Final 
> Path: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_0
> 2020-02-04 03:28:58,533 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
> records written - 1
> 2020-02-04 03:28:58,535 [INFO] [TezChild] |orc.WriterImpl|: WIDE TABLE - 
> Number of columns: 176 Chosen compression buffer size: 32768
> 2020-02-04 03:28:58,573 [INFO] [TezChild] |orc.WriterImpl|: ORC writer 
> created for path: 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_0
>  with stripeSize: 67108864 blockSize: 268435456 compression: ZLIB bufferSize: 
> 262144
> 2020-02-04 03:28:58,599 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
> records written - 10
> 2020-02-04 03:28:58,775 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
> records written - 100
> 2020-02-04 03:28:59,396 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
> records written - 1000
> {code}
> {code:java}
> Log of succesfully second attempt
> 2020-02-04 07:56:51,488 [INFO] [TezChild] |tez.ReduceRecordProcessor|: 
> <MERGEJOIN>Id =0
>   <Children>
>     <SEL>Id =1
>       <Children>
>         <FS>Id =2
>           <Children>
>           <\Children>
>           <Parent>Id = 1 null<\Parent>
>         <\FS>
>       <\Children>
>       <Parent>Id = 0 null<\Parent>
>     <\SEL>
>   <\Children>
>   <Parent><\Parent>
> <\MERGEJOIN>
> 2020-02-04 07:56:51,504 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG 
> method=TezInitializeOperators start=1580792209194 end=1580792211504 
> duration=2310 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
> 2020-02-04 07:56:51,504 [INFO] [TezChild] |tez.ReduceRecordProcessor|: 
> Starting Output: out_Reducer 2
> 2020-02-04 07:56:51,506 [WARN] [TezChild] 
> |objectinspector.StandardStructObjectInspector|: Trying to access 2 fields 
> inside a list of 1 elements: [[06640113161911]]
> 2020-02-04 07:56:51,506 [WARN] [TezChild] 
> |objectinspector.StandardStructObjectInspector|: ignoring similar errors.
> 2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: Final 
> Path: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_1
> 2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to 
> temp file: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_1
> 2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: New Final 
> Path: FS 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_1
> 2020-02-04 07:56:51,542 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 1
> 2020-02-04 07:56:51,552 [INFO] [TezChild] |orc.WriterImpl|: WIDE TABLE - 
> Number of columns: 176 Chosen compression buffer size: 32768
> 2020-02-04 07:56:51,598 [INFO] [TezChild] |orc.WriterImpl|: ORC writer 
> created for path: 
> hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_1
>  with stripeSize: 67108864 blockSize: 268435456 compression: ZLIB bufferSize: 
> 262144
> 2020-02-04 07:56:51,611 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 10
> 2020-02-04 07:56:51,653 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 100
> 2020-02-04 07:56:51,853 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 1000
> 2020-02-04 07:56:53,393 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 10000
> 2020-02-04 07:57:04,770 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 100000
> 2020-02-04 07:58:51,014 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 1000000
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG 
> method=TezRunProcessor start=1580792208921 end=1580792336286 duration=127365 
> from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.CommonMergeJoinOperator|: 
> Closing operator MERGEJOIN[0]
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.SelectOperator|: Closing 
> operator SEL[1]
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.FileSinkOperator|: Closing 
> operator FS[2]
> 2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
> records written - 1040870
> 2020-02-04 07:58:56,517 [INFO] [TezChild] |exec.FileSinkOperator|: 
> RECORDS_OUT_1_sample.rpo_current_info_new:1040870, 
> 2020-02-04 07:58:56,517 [INFO] [TezChild] |task.TezTaskRunner|: Closing task, 
> taskAttemptId=attempt_1579977836338_6440_2_02_001098_1
> 2020-02-04 07:58:56,517 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting 
> down Shuffle for source: Map_1
> 2020-02-04 07:58:56,517 [INFO] [TezChild] 
> |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 1: 
> numDmeEventsSeen=3337, numDmeEventsSeenWithNoData=3336, 
> numObsoletionEventsSeen=0, updateOnClose
> 2020-02-04 07:58:56,518 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting 
> down Shuffle for source: Map_3
> 2020-02-04 07:58:56,518 [INFO] [TezChild] 
> |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 3: 
> numDmeEventsSeen=194, numDmeEventsSeenWithNoData=193, 
> numObsoletionEventsSeen=0, updateOnClose
> 2020-02-04 07:58:56,518 [INFO] [TezChild] |output.MROutput|: out_Reducer 2 
> closed
> 2020-02-04 07:58:56,518 [INFO] [TezChild] 
> |runtime.LogicalIOProcessorRuntimeTask|: Joining on EventRouter
> 2020-02-04 07:58:56,522 [INFO] [TezChild] |task.TezTaskRunner|: Task 
> completed, taskAttemptId=attempt_1579977836338_6440_2_02_001098_1, 
> fatalErrorOccurred=false
> 2020-02-04 07:58:56,528 [INFO] [TezChild] 
> |runtime.LogicalIOProcessorRuntimeTask|: Final Counters for 
> attempt_1579977836338_6440_2_02_001098_1: Counters: 98 [[File System Counters 
> FILE_BYTES_READ=383172873, FILE_BYTES_WRITTEN=386154953, 
> HDFS_BYTES_WRITTEN=252446634, HDFS_READ_OPS=3, HDFS_WRITE_OPS=3, 
> HDFS_OP_CREATE=2, HDFS_OP_GET_FILE_STATUS=3, 
> HDFS_OP_RENAME=1][org.apache.tez.common.counters.TaskCounter 
> REDUCE_INPUT_GROUPS=1043992, REDUCE_INPUT_RECORDS=1043992, 
> COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=1043992, NUM_SHUFFLED_INPUTS=2, 
> NUM_SKIPPED_INPUTS=3529, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=2, 
> GC_TIME_MILLIS=686, CPU_MILLISECONDS=156620, 
> PHYSICAL_MEMORY_BYTES=2147483648, VIRTUAL_MEMORY_BYTES=5494951936, 
> COMMITTED_HEAP_BYTES=2147483648, OUTPUT_RECORDS=0, 
> ADDITIONAL_SPILLS_BYTES_WRITTEN=1468064, 
> ADDITIONAL_SPILLS_BYTES_READ=383172873, SHUFFLE_BYTES=383172873, 
> SHUFFLE_BYTES_DECOMPRESSED=1487594202, SHUFFLE_BYTES_TO_MEM=1468064, 
> SHUFFLE_BYTES_TO_DISK=381704809, SHUFFLE_BYTES_DISK_DIRECT=0, 
> NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=2574, 
> MERGE_PHASE_TIME=2649, FIRST_EVENT_RECEIVED=49, LAST_EVENT_RECEIVED=318][HIVE 
> CREATED_FILES=1, RECORDS_OUT_1_sample.rpo_current_info_new=1040870][Shuffle 
> Errors BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, 
> WRONG_REDUCE=0][Shuffle Errors_Reducer_2_INPUT_Map_1 BAD_ID=0, CONNECTION=0, 
> IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][Shuffle 
> Errors_Reducer_2_INPUT_Map_3 BAD_ID=0, CONNECTION=0, IO_ERROR=0, 
> WRONG_LENGTH=0, WRONG_MAP=0, 
> WRONG_REDUCE=0][TaskCounter_Reducer_2_INPUT_Map_1 
> ADDITIONAL_SPILLS_BYTES_READ=381704809, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=28, LAST_EVENT_RECEIVED=185, 
> MERGED_MAP_OUTPUTS=1, MERGE_PHASE_TIME=2348, NUM_DISK_TO_DISK_MERGES=0, 
> NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=1, 
> NUM_SKIPPED_INPUTS=3336, REDUCE_INPUT_GROUPS=1040125, 
> REDUCE_INPUT_RECORDS=1040125, SHUFFLE_BYTES=381704809, 
> SHUFFLE_BYTES_DECOMPRESSED=1481826651, SHUFFLE_BYTES_DISK_DIRECT=0, 
> SHUFFLE_BYTES_TO_DISK=381704809, SHUFFLE_BYTES_TO_MEM=0, 
> SHUFFLE_PHASE_TIME=2343, 
> SPILLED_RECORDS=1040125][TaskCounter_Reducer_2_INPUT_Map_3 
> ADDITIONAL_SPILLS_BYTES_READ=1468064, 
> ADDITIONAL_SPILLS_BYTES_WRITTEN=1468064, COMBINE_INPUT_RECORDS=0, 
> FIRST_EVENT_RECEIVED=21, LAST_EVENT_RECEIVED=133, MERGED_MAP_OUTPUTS=1, 
> MERGE_PHASE_TIME=301, NUM_DISK_TO_DISK_MERGES=0, NUM_FAILED_SHUFFLE_INPUTS=0, 
> NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=193, 
> REDUCE_INPUT_GROUPS=3867, REDUCE_INPUT_RECORDS=3867, SHUFFLE_BYTES=1468064, 
> SHUFFLE_BYTES_DECOMPRESSED=5767551, SHUFFLE_BYTES_DISK_DIRECT=0, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=1468064, 
> SHUFFLE_PHASE_TIME=231, 
> SPILLED_RECORDS=3867][TaskCounter_Reducer_2_OUTPUT_out_Reducer_2 
> OUTPUT_RECORDS=0]]
> 2020-02-04 07:58:56,630 [INFO] [main] |common.TezUtilsInternal|: Redirecting 
> log file based on addend: 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to