dobrovolskiy created TEZ-4121:
---------------------------------

             Summary: Tez task infinitely hanging on [TezChild] 
|exec.FileSinkOperator|: FS[2]: records written -
                 Key: TEZ-4121
                 URL: https://issues.apache.org/jira/browse/TEZ-4121
             Project: Apache Tez
          Issue Type: Bug
            Reporter: dobrovolskiy
         Attachments: dag_details.png, grafical_view.png, task_attempts.png, 
vertex_details_reducer_2.png

On tez 0.7.0.2.6.4.0-91 we've got an infinetly hang task without PID on the 
datanode host
 This task was successfully executed on the second attempt only after container 
killed.
The cluster was free of resources for this application.
Could you suggest how to fix this issue? 
The search of bugs for this strange behaviour in Jira doesn't get something.

See screenshots from TEZ UI:
timing of task_attempts.png
vertex_details_reducer_2.png
dag_details.png
grafical_view.png



Log of hanging task:
{code:java}
2020-02-04 03:28:58,502 [INFO] [TezChild] |tez.ReduceRecordProcessor|: 
<MERGEJOIN>Id =5
  <Children>
    <SEL>Id =6
      <Children>
        <FS>Id =7
          <Children>
          <\Children>
          <Parent>Id = 6 null<\Parent>
        <\FS>
      <\Children>
      <Parent>Id = 5 null<\Parent>
    <\SEL>
  <\Children>
  <Parent><\Parent>
<\MERGEJOIN>
2020-02-04 03:28:58,502 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG 
method=TezInitializeOperators start=1580775588151 end=1580776138502 
duration=550351 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
2020-02-04 03:28:58,502 [INFO] [TezChild] |tez.ReduceRecordProcessor|: Starting 
Output: out_Reducer 2
2020-02-04 03:28:58,504 [WARN] [TezChild] 
|objectinspector.StandardStructObjectInspector|: Trying to access 2 fields 
inside a list of 1 elements: [[06640113161911]]
2020-02-04 03:28:58,504 [WARN] [TezChild] 
|objectinspector.StandardStructObjectInspector|: ignoring similar errors.
2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: Final Path: 
FS 
hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_0
2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to 
temp file: FS 
hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_0
2020-02-04 03:28:58,514 [INFO] [TezChild] |exec.FileSinkOperator|: New Final 
Path: FS 
hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_0
2020-02-04 03:28:58,533 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
records written - 1
2020-02-04 03:28:58,535 [INFO] [TezChild] |orc.WriterImpl|: WIDE TABLE - Number 
of columns: 176 Chosen compression buffer size: 32768
2020-02-04 03:28:58,573 [INFO] [TezChild] |orc.WriterImpl|: ORC writer created 
for path: 
hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_0
 with stripeSize: 67108864 blockSize: 268435456 compression: ZLIB bufferSize: 
262144
2020-02-04 03:28:58,599 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
records written - 10
2020-02-04 03:28:58,775 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
records written - 100
2020-02-04 03:28:59,396 [INFO] [TezChild] |exec.FileSinkOperator|: FS[7]: 
records written - 1000
{code}
{code:java}
Log of succesfully second attempt
2020-02-04 07:56:51,488 [INFO] [TezChild] |tez.ReduceRecordProcessor|: 
<MERGEJOIN>Id =0
  <Children>
    <SEL>Id =1
      <Children>
        <FS>Id =2
          <Children>
          <\Children>
          <Parent>Id = 1 null<\Parent>
        <\FS>
      <\Children>
      <Parent>Id = 0 null<\Parent>
    <\SEL>
  <\Children>
  <Parent><\Parent>
<\MERGEJOIN>
2020-02-04 07:56:51,504 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG 
method=TezInitializeOperators start=1580792209194 end=1580792211504 
duration=2310 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
2020-02-04 07:56:51,504 [INFO] [TezChild] |tez.ReduceRecordProcessor|: Starting 
Output: out_Reducer 2
2020-02-04 07:56:51,506 [WARN] [TezChild] 
|objectinspector.StandardStructObjectInspector|: Trying to access 2 fields 
inside a list of 1 elements: [[06640113161911]]
2020-02-04 07:56:51,506 [WARN] [TezChild] 
|objectinspector.StandardStructObjectInspector|: ignoring similar errors.
2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: Final Path: 
FS 
hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_1
2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to 
temp file: FS 
hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_1
2020-02-04 07:56:51,511 [INFO] [TezChild] |exec.FileSinkOperator|: New Final 
Path: FS 
hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_tmp.-ext-10001/001098_1
2020-02-04 07:56:51,542 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
records written - 1
2020-02-04 07:56:51,552 [INFO] [TezChild] |orc.WriterImpl|: WIDE TABLE - Number 
of columns: 176 Chosen compression buffer size: 32768
2020-02-04 07:56:51,598 [INFO] [TezChild] |orc.WriterImpl|: ORC writer created 
for path: 
hdfs://cluster/apps/hive/warehouse/sample.db/.hive-staging_hive_2020-02-04_03-10-36_142_570908369342623638-1/_task_tmp.-ext-10001/_tmp.001098_1
 with stripeSize: 67108864 blockSize: 268435456 compression: ZLIB bufferSize: 
262144
2020-02-04 07:56:51,611 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
records written - 10
2020-02-04 07:56:51,653 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
records written - 100
2020-02-04 07:56:51,853 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
records written - 1000
2020-02-04 07:56:53,393 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
records written - 10000
2020-02-04 07:57:04,770 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
records written - 100000
2020-02-04 07:58:51,014 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
records written - 1000000
2020-02-04 07:58:56,286 [INFO] [TezChild] |log.PerfLogger|: </PERFLOG 
method=TezRunProcessor start=1580792208921 end=1580792336286 duration=127365 
from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.CommonMergeJoinOperator|: 
Closing operator MERGEJOIN[0]
2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.SelectOperator|: Closing 
operator SEL[1]
2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.FileSinkOperator|: Closing 
operator FS[2]
2020-02-04 07:58:56,286 [INFO] [TezChild] |exec.FileSinkOperator|: FS[2]: 
records written - 1040870
2020-02-04 07:58:56,517 [INFO] [TezChild] |exec.FileSinkOperator|: 
RECORDS_OUT_1_sample.rpo_current_info_new:1040870, 
2020-02-04 07:58:56,517 [INFO] [TezChild] |task.TezTaskRunner|: Closing task, 
taskAttemptId=attempt_1579977836338_6440_2_02_001098_1
2020-02-04 07:58:56,517 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting 
down Shuffle for source: Map_1
2020-02-04 07:58:56,517 [INFO] [TezChild] 
|orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 1: 
numDmeEventsSeen=3337, numDmeEventsSeenWithNoData=3336, 
numObsoletionEventsSeen=0, updateOnClose
2020-02-04 07:58:56,518 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting 
down Shuffle for source: Map_3
2020-02-04 07:58:56,518 [INFO] [TezChild] 
|orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 3: 
numDmeEventsSeen=194, numDmeEventsSeenWithNoData=193, 
numObsoletionEventsSeen=0, updateOnClose
2020-02-04 07:58:56,518 [INFO] [TezChild] |output.MROutput|: out_Reducer 2 
closed
2020-02-04 07:58:56,518 [INFO] [TezChild] 
|runtime.LogicalIOProcessorRuntimeTask|: Joining on EventRouter
2020-02-04 07:58:56,522 [INFO] [TezChild] |task.TezTaskRunner|: Task completed, 
taskAttemptId=attempt_1579977836338_6440_2_02_001098_1, fatalErrorOccurred=false
2020-02-04 07:58:56,528 [INFO] [TezChild] 
|runtime.LogicalIOProcessorRuntimeTask|: Final Counters for 
attempt_1579977836338_6440_2_02_001098_1: Counters: 98 [[File System Counters 
FILE_BYTES_READ=383172873, FILE_BYTES_WRITTEN=386154953, 
HDFS_BYTES_WRITTEN=252446634, HDFS_READ_OPS=3, HDFS_WRITE_OPS=3, 
HDFS_OP_CREATE=2, HDFS_OP_GET_FILE_STATUS=3, 
HDFS_OP_RENAME=1][org.apache.tez.common.counters.TaskCounter 
REDUCE_INPUT_GROUPS=1043992, REDUCE_INPUT_RECORDS=1043992, 
COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=1043992, NUM_SHUFFLED_INPUTS=2, 
NUM_SKIPPED_INPUTS=3529, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=2, 
GC_TIME_MILLIS=686, CPU_MILLISECONDS=156620, PHYSICAL_MEMORY_BYTES=2147483648, 
VIRTUAL_MEMORY_BYTES=5494951936, COMMITTED_HEAP_BYTES=2147483648, 
OUTPUT_RECORDS=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=1468064, 
ADDITIONAL_SPILLS_BYTES_READ=383172873, SHUFFLE_BYTES=383172873, 
SHUFFLE_BYTES_DECOMPRESSED=1487594202, SHUFFLE_BYTES_TO_MEM=1468064, 
SHUFFLE_BYTES_TO_DISK=381704809, SHUFFLE_BYTES_DISK_DIRECT=0, 
NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=2574, 
MERGE_PHASE_TIME=2649, FIRST_EVENT_RECEIVED=49, LAST_EVENT_RECEIVED=318][HIVE 
CREATED_FILES=1, RECORDS_OUT_1_sample.rpo_current_info_new=1040870][Shuffle 
Errors BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, 
WRONG_REDUCE=0][Shuffle Errors_Reducer_2_INPUT_Map_1 BAD_ID=0, CONNECTION=0, 
IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][Shuffle 
Errors_Reducer_2_INPUT_Map_3 BAD_ID=0, CONNECTION=0, IO_ERROR=0, 
WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][TaskCounter_Reducer_2_INPUT_Map_1 
ADDITIONAL_SPILLS_BYTES_READ=381704809, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=28, LAST_EVENT_RECEIVED=185, 
MERGED_MAP_OUTPUTS=1, MERGE_PHASE_TIME=2348, NUM_DISK_TO_DISK_MERGES=0, 
NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=1, 
NUM_SKIPPED_INPUTS=3336, REDUCE_INPUT_GROUPS=1040125, 
REDUCE_INPUT_RECORDS=1040125, SHUFFLE_BYTES=381704809, 
SHUFFLE_BYTES_DECOMPRESSED=1481826651, SHUFFLE_BYTES_DISK_DIRECT=0, 
SHUFFLE_BYTES_TO_DISK=381704809, SHUFFLE_BYTES_TO_MEM=0, 
SHUFFLE_PHASE_TIME=2343, 
SPILLED_RECORDS=1040125][TaskCounter_Reducer_2_INPUT_Map_3 
ADDITIONAL_SPILLS_BYTES_READ=1468064, ADDITIONAL_SPILLS_BYTES_WRITTEN=1468064, 
COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=21, LAST_EVENT_RECEIVED=133, 
MERGED_MAP_OUTPUTS=1, MERGE_PHASE_TIME=301, NUM_DISK_TO_DISK_MERGES=0, 
NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=1, 
NUM_SKIPPED_INPUTS=193, REDUCE_INPUT_GROUPS=3867, REDUCE_INPUT_RECORDS=3867, 
SHUFFLE_BYTES=1468064, SHUFFLE_BYTES_DECOMPRESSED=5767551, 
SHUFFLE_BYTES_DISK_DIRECT=0, SHUFFLE_BYTES_TO_DISK=0, 
SHUFFLE_BYTES_TO_MEM=1468064, SHUFFLE_PHASE_TIME=231, 
SPILLED_RECORDS=3867][TaskCounter_Reducer_2_OUTPUT_out_Reducer_2 
OUTPUT_RECORDS=0]]
2020-02-04 07:58:56,630 [INFO] [main] |common.TezUtilsInternal|: Redirecting 
log file based on addend: 
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to