[ 
https://issues.apache.org/jira/browse/HIVE-24111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190804#comment-17190804
 ] 

László Bodor edited comment on HIVE-24111 at 9/5/20, 7:29 AM:
--------------------------------------------------------------

For reference:
logs for a good run:  
[^org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez-output.txt] 
logs for a hanging run 1:  [^TestCrudCompactorOnTez.log] 
logs for a hanging run 2:  [^TestCrudCompactorOnTez2.log] 

what is strange for the first sight, I cannot see MergeManager related log 
messages when it's expected, so this could be a shuffle issue

good run:
{code}
2020-09-03T15:13:19,604  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None, 
ifileReadAhead: true
2020-09-03T15:13:19,605  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847, 
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10, 
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-03T15:13:19,605  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map 
1 with configuration: maxFetchFailuresBeforeReporting=5, 
reportReadErrorImmediately=true, maxFailedUniqueFetches=1, 
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1, 
hostFailureFraction=0.2, minFailurePerHost=4, 
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5, 
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-03T15:13:19,606  INFO [I/O Setup 0 Start: {Map 1}] 
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-03T15:13:19,606  INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: 
AutoStartComplete
2020-09-03T15:13:19,606  INFO [ShuffleAndMergeRunner {Map_1}] 
orderedgrouped.MergeManager: Setting merger's parent thread to 
ShuffleAndMergeRunner {Map_1}
2020-09-03T15:13:19,606  INFO [TezChild] task.TaskRunner2Callable: Running 
task, taskAttemptId=attempt_1599171197926_0001_1_01_000000_0
2020-09-03T15:13:19,607  INFO 
[TezTaskEventRouter{attempt_1599171197926_0001_1_01_000000_0}] 
orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: Map 1: 
numDmeEventsSeen=1, numDmeEventsSeenWithNoData=0, numObsoletionEventsSeen=0
2020-09-03T15:13:19,607  INFO [TezChild] exec.SerializationUtilities: 
Deserializing ReduceWork using kryo
2020-09-03T15:13:19,607  INFO [TezChild] exec.Utilities: Deserialized plan (via 
RPC) - name: Reducer 2 size: 1.87KB
2020-09-03T15:13:19,607  INFO [TezChild] tez.ObjectCache: Caching key: 
lbodor_20200903151317_7f539b53-07fb-4bb1-97db-c37d72aba99d_Reducer 
2__REDUCE_PLAN__
2020-09-03T15:13:19,607  INFO [TezChild] tez.RecordProcessor: conf class path = 
[]
2020-09-03T15:13:19,608  INFO [TezChild] tez.RecordProcessor: thread class path 
= []
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] 
orderedgrouped.MergeManager: close onDiskFile. State: NumOnDiskFiles=1. 
Current: 
path=/Users/lbodor/apache/hive/itests/hive-unit/target/tmp/scratchdir/lbodor/_tez_session_dir/e01fa9d5-36d9-4449-bfa4-d12b5fa290f8/.tez/application_1599171197926_0001_wd/localmode-local-dir/output/attempt_1599171197926_0001_1_00_000000_0_10098/file.out,
 len=26
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] ShuffleScheduler.fetch: 
Completed fetch for attempt: {0, 0, 
attempt_1599171197926_0001_1_00_000000_0_10098} to DISK_DIRECT, csize=26, 
dsize=22, EndTime=1599171199608, TimeTaken=1, Rate=0.02 MB/s
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] 
orderedgrouped.ShuffleScheduler: All inputs fetched for input vertex : Map 1
{code}

hanging run:
{code}
2020-09-04T02:12:16,392  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None, 
ifileReadAhead: true
2020-09-04T02:12:16,392  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847, 
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10, 
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-04T02:12:16,394  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map 
1 with configuration: maxFetchFailuresBeforeReporting=5, 
reportReadErrorImmediately=true, maxFailedUniqueFetches=1, 
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1, 
hostFailureFraction=0.2, minFailurePerHost=4, 
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5, 
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-04T02:12:16,398  INFO [I/O Setup 0 Start: {Map 1}] 
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-04T02:12:16,398  INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: 
AutoStartComplete
2020-09-04T02:12:16,398  INFO [TezChild] task.TaskRunner2Callable: Running 
task, taskAttemptId=attempt_1599210735282_0001_1_01_000000_0
2020-09-04T02:12:16,416  INFO [TezChild] tez.ReduceRecordProcessor: Waiting for 
ShuffleInputs to become ready
{code}

so in the hanging run we reach the point of ShuffleScheduler instantiation, but 
than hanging here according to jstack
{code}
"ShuffleAndMergeRunner {Map_1}" #919 daemon prio=5 os_prio=0 
tid=0x00007ff864012800 nid=0x1959e9 in Object.wait() [0x00007ff835ee1000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.waitAndNotifyProgress(ShuffleScheduler.java:1459)
        - locked <0x00000000fa885a20> (a 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.access$700(ShuffleScheduler.java:85)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$ShuffleSchedulerCallable.callInternal(ShuffleScheduler.java:1374)
        - locked <0x00000000fa885a20> (a 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$ShuffleSchedulerCallable.callInternal(ShuffleScheduler.java:1364)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.start(ShuffleScheduler.java:444)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:293)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:287)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}


was (Author: abstractdog):
For reference:
logs for a good run:  
[^org.apache.hadoop.hive.ql.txn.compactor.TestMmCompactorOnTez-output.txt.log] 
logs for a hanging run:  [^TestCrudCompactorTez.log] 

what is strange for the first sight, I cannot see MergeManager related log 
messages when it's expected, so this could be a shuffle issue

good run:
{code}
2020-09-03T15:13:19,604  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None, 
ifileReadAhead: true
2020-09-03T15:13:19,605  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847, 
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10, 
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-03T15:13:19,605  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map 
1 with configuration: maxFetchFailuresBeforeReporting=5, 
reportReadErrorImmediately=true, maxFailedUniqueFetches=1, 
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1, 
hostFailureFraction=0.2, minFailurePerHost=4, 
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5, 
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-03T15:13:19,606  INFO [I/O Setup 0 Start: {Map 1}] 
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-03T15:13:19,606  INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: 
AutoStartComplete
2020-09-03T15:13:19,606  INFO [ShuffleAndMergeRunner {Map_1}] 
orderedgrouped.MergeManager: Setting merger's parent thread to 
ShuffleAndMergeRunner {Map_1}
2020-09-03T15:13:19,606  INFO [TezChild] task.TaskRunner2Callable: Running 
task, taskAttemptId=attempt_1599171197926_0001_1_01_000000_0
2020-09-03T15:13:19,607  INFO 
[TezTaskEventRouter{attempt_1599171197926_0001_1_01_000000_0}] 
orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: Map 1: 
numDmeEventsSeen=1, numDmeEventsSeenWithNoData=0, numObsoletionEventsSeen=0
2020-09-03T15:13:19,607  INFO [TezChild] exec.SerializationUtilities: 
Deserializing ReduceWork using kryo
2020-09-03T15:13:19,607  INFO [TezChild] exec.Utilities: Deserialized plan (via 
RPC) - name: Reducer 2 size: 1.87KB
2020-09-03T15:13:19,607  INFO [TezChild] tez.ObjectCache: Caching key: 
lbodor_20200903151317_7f539b53-07fb-4bb1-97db-c37d72aba99d_Reducer 
2__REDUCE_PLAN__
2020-09-03T15:13:19,607  INFO [TezChild] tez.RecordProcessor: conf class path = 
[]
2020-09-03T15:13:19,608  INFO [TezChild] tez.RecordProcessor: thread class path 
= []
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] 
orderedgrouped.MergeManager: close onDiskFile. State: NumOnDiskFiles=1. 
Current: 
path=/Users/lbodor/apache/hive/itests/hive-unit/target/tmp/scratchdir/lbodor/_tez_session_dir/e01fa9d5-36d9-4449-bfa4-d12b5fa290f8/.tez/application_1599171197926_0001_wd/localmode-local-dir/output/attempt_1599171197926_0001_1_00_000000_0_10098/file.out,
 len=26
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] ShuffleScheduler.fetch: 
Completed fetch for attempt: {0, 0, 
attempt_1599171197926_0001_1_00_000000_0_10098} to DISK_DIRECT, csize=26, 
dsize=22, EndTime=1599171199608, TimeTaken=1, Rate=0.02 MB/s
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] 
orderedgrouped.ShuffleScheduler: All inputs fetched for input vertex : Map 1
{code}

hanging run:
{code}
2020-09-04T02:12:16,392  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None, 
ifileReadAhead: true
2020-09-04T02:12:16,392  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847, 
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10, 
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-04T02:12:16,394  INFO [I/O Setup 0 Start: {Map 1}] 
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map 
1 with configuration: maxFetchFailuresBeforeReporting=5, 
reportReadErrorImmediately=true, maxFailedUniqueFetches=1, 
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1, 
hostFailureFraction=0.2, minFailurePerHost=4, 
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5, 
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-04T02:12:16,398  INFO [I/O Setup 0 Start: {Map 1}] 
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-04T02:12:16,398  INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: 
AutoStartComplete
2020-09-04T02:12:16,398  INFO [TezChild] task.TaskRunner2Callable: Running 
task, taskAttemptId=attempt_1599210735282_0001_1_01_000000_0
2020-09-04T02:12:16,416  INFO [TezChild] tez.ReduceRecordProcessor: Waiting for 
ShuffleInputs to become ready
{code}

so in the hanging run we reach the point of ShuffleScheduler instantiation, but 
than hanging here according to jstack
{code}
"ShuffleAndMergeRunner {Map_1}" #919 daemon prio=5 os_prio=0 
tid=0x00007ff864012800 nid=0x1959e9 in Object.wait() [0x00007ff835ee1000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.waitAndNotifyProgress(ShuffleScheduler.java:1459)
        - locked <0x00000000fa885a20> (a 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.access$700(ShuffleScheduler.java:85)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$ShuffleSchedulerCallable.callInternal(ShuffleScheduler.java:1374)
        - locked <0x00000000fa885a20> (a 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$ShuffleSchedulerCallable.callInternal(ShuffleScheduler.java:1364)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.start(ShuffleScheduler.java:444)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:293)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:287)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

> TestMmCompactorOnTez/TestCrudCompactorOnTez hangs when running against Tez 
> 0.10.0 staging artifact
> --------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-24111
>                 URL: https://issues.apache.org/jira/browse/HIVE-24111
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TestCrudCompactorOnTez.log, TestCrudCompactorOnTez2.log, 
> jstack.log, 
> org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez-output.txt
>
>
> Reproduced issue in ptest run which I made to run against tez staging 
> artifacts 
> (https://repository.apache.org/content/repositories/orgapachetez-1068/)
> http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-1311/14/pipeline/417
> I'm about to investigate this. I think Tez 0.10.0 cannot be released until we 
> won't confirm if it's a hive or tez bug.
> {code}
> mvn test -Pitests,hadoop-2 -Dtest=TestMmCompactorOnTez -pl ./itests/hive-unit
> {code}
> tez setup:
> https://github.com/apache/hive/commit/92516631ab39f39df5d0692f98ac32c2cd320997#diff-a22bcc9ba13b310c7abfee4a57c4b130R83-R97



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to