[
https://issues.apache.org/jira/browse/HIVE-24111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190804#comment-17190804
]
László Bodor edited comment on HIVE-24111 at 9/5/20, 7:29 AM:
--------------------------------------------------------------
For reference:
logs for a good run:
[^org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez-output.txt]
logs for a hanging run 1: [^TestCrudCompactorOnTez.log]
logs for a hanging run 2: [^TestCrudCompactorOnTez2.log]
what is strange for the first sight, I cannot see MergeManager related log
messages when it's expected, so this could be a shuffle issue
good run:
{code}
2020-09-03T15:13:19,604 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None,
ifileReadAhead: true
2020-09-03T15:13:19,605 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847,
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10,
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-03T15:13:19,605 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map
1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1,
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1,
hostFailureFraction=0.2, minFailurePerHost=4,
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5,
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-03T15:13:19,606 INFO [I/O Setup 0 Start: {Map 1}]
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-03T15:13:19,606 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask:
AutoStartComplete
2020-09-03T15:13:19,606 INFO [ShuffleAndMergeRunner {Map_1}]
orderedgrouped.MergeManager: Setting merger's parent thread to
ShuffleAndMergeRunner {Map_1}
2020-09-03T15:13:19,606 INFO [TezChild] task.TaskRunner2Callable: Running
task, taskAttemptId=attempt_1599171197926_0001_1_01_000000_0
2020-09-03T15:13:19,607 INFO
[TezTaskEventRouter{attempt_1599171197926_0001_1_01_000000_0}]
orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: Map 1:
numDmeEventsSeen=1, numDmeEventsSeenWithNoData=0, numObsoletionEventsSeen=0
2020-09-03T15:13:19,607 INFO [TezChild] exec.SerializationUtilities:
Deserializing ReduceWork using kryo
2020-09-03T15:13:19,607 INFO [TezChild] exec.Utilities: Deserialized plan (via
RPC) - name: Reducer 2 size: 1.87KB
2020-09-03T15:13:19,607 INFO [TezChild] tez.ObjectCache: Caching key:
lbodor_20200903151317_7f539b53-07fb-4bb1-97db-c37d72aba99d_Reducer
2__REDUCE_PLAN__
2020-09-03T15:13:19,607 INFO [TezChild] tez.RecordProcessor: conf class path =
[]
2020-09-03T15:13:19,608 INFO [TezChild] tez.RecordProcessor: thread class path
= []
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0]
orderedgrouped.MergeManager: close onDiskFile. State: NumOnDiskFiles=1.
Current:
path=/Users/lbodor/apache/hive/itests/hive-unit/target/tmp/scratchdir/lbodor/_tez_session_dir/e01fa9d5-36d9-4449-bfa4-d12b5fa290f8/.tez/application_1599171197926_0001_wd/localmode-local-dir/output/attempt_1599171197926_0001_1_00_000000_0_10098/file.out,
len=26
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0] ShuffleScheduler.fetch:
Completed fetch for attempt: {0, 0,
attempt_1599171197926_0001_1_00_000000_0_10098} to DISK_DIRECT, csize=26,
dsize=22, EndTime=1599171199608, TimeTaken=1, Rate=0.02 MB/s
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0]
orderedgrouped.ShuffleScheduler: All inputs fetched for input vertex : Map 1
{code}
hanging run:
{code}
2020-09-04T02:12:16,392 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None,
ifileReadAhead: true
2020-09-04T02:12:16,392 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847,
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10,
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-04T02:12:16,394 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map
1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1,
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1,
hostFailureFraction=0.2, minFailurePerHost=4,
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5,
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-04T02:12:16,398 INFO [I/O Setup 0 Start: {Map 1}]
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-04T02:12:16,398 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask:
AutoStartComplete
2020-09-04T02:12:16,398 INFO [TezChild] task.TaskRunner2Callable: Running
task, taskAttemptId=attempt_1599210735282_0001_1_01_000000_0
2020-09-04T02:12:16,416 INFO [TezChild] tez.ReduceRecordProcessor: Waiting for
ShuffleInputs to become ready
{code}
so in the hanging run we reach the point of ShuffleScheduler instantiation, but
than hanging here according to jstack
{code}
"ShuffleAndMergeRunner {Map_1}" #919 daemon prio=5 os_prio=0
tid=0x00007ff864012800 nid=0x1959e9 in Object.wait() [0x00007ff835ee1000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.waitAndNotifyProgress(ShuffleScheduler.java:1459)
- locked <0x00000000fa885a20> (a
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.access$700(ShuffleScheduler.java:85)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$ShuffleSchedulerCallable.callInternal(ShuffleScheduler.java:1374)
- locked <0x00000000fa885a20> (a
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$ShuffleSchedulerCallable.callInternal(ShuffleScheduler.java:1364)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.start(ShuffleScheduler.java:444)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:293)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:287)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
at
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
at
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
was (Author: abstractdog):
For reference:
logs for a good run:
[^org.apache.hadoop.hive.ql.txn.compactor.TestMmCompactorOnTez-output.txt.log]
logs for a hanging run: [^TestCrudCompactorTez.log]
what is strange for the first sight, I cannot see MergeManager related log
messages when it's expected, so this could be a shuffle issue
good run:
{code}
2020-09-03T15:13:19,604 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None,
ifileReadAhead: true
2020-09-03T15:13:19,605 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847,
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10,
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-03T15:13:19,605 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map
1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1,
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1,
hostFailureFraction=0.2, minFailurePerHost=4,
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5,
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-03T15:13:19,606 INFO [I/O Setup 0 Start: {Map 1}]
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-03T15:13:19,606 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask:
AutoStartComplete
2020-09-03T15:13:19,606 INFO [ShuffleAndMergeRunner {Map_1}]
orderedgrouped.MergeManager: Setting merger's parent thread to
ShuffleAndMergeRunner {Map_1}
2020-09-03T15:13:19,606 INFO [TezChild] task.TaskRunner2Callable: Running
task, taskAttemptId=attempt_1599171197926_0001_1_01_000000_0
2020-09-03T15:13:19,607 INFO
[TezTaskEventRouter{attempt_1599171197926_0001_1_01_000000_0}]
orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: Map 1:
numDmeEventsSeen=1, numDmeEventsSeenWithNoData=0, numObsoletionEventsSeen=0
2020-09-03T15:13:19,607 INFO [TezChild] exec.SerializationUtilities:
Deserializing ReduceWork using kryo
2020-09-03T15:13:19,607 INFO [TezChild] exec.Utilities: Deserialized plan (via
RPC) - name: Reducer 2 size: 1.87KB
2020-09-03T15:13:19,607 INFO [TezChild] tez.ObjectCache: Caching key:
lbodor_20200903151317_7f539b53-07fb-4bb1-97db-c37d72aba99d_Reducer
2__REDUCE_PLAN__
2020-09-03T15:13:19,607 INFO [TezChild] tez.RecordProcessor: conf class path =
[]
2020-09-03T15:13:19,608 INFO [TezChild] tez.RecordProcessor: thread class path
= []
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0]
orderedgrouped.MergeManager: close onDiskFile. State: NumOnDiskFiles=1.
Current:
path=/Users/lbodor/apache/hive/itests/hive-unit/target/tmp/scratchdir/lbodor/_tez_session_dir/e01fa9d5-36d9-4449-bfa4-d12b5fa290f8/.tez/application_1599171197926_0001_wd/localmode-local-dir/output/attempt_1599171197926_0001_1_00_000000_0_10098/file.out,
len=26
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0] ShuffleScheduler.fetch:
Completed fetch for attempt: {0, 0,
attempt_1599171197926_0001_1_00_000000_0_10098} to DISK_DIRECT, csize=26,
dsize=22, EndTime=1599171199608, TimeTaken=1, Rate=0.02 MB/s
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0]
orderedgrouped.ShuffleScheduler: All inputs fetched for input vertex : Map 1
{code}
hanging run:
{code}
2020-09-04T02:12:16,392 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None,
ifileReadAhead: true
2020-09-04T02:12:16,392 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847,
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10,
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-04T02:12:16,394 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map
1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1,
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1,
hostFailureFraction=0.2, minFailurePerHost=4,
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5,
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-04T02:12:16,398 INFO [I/O Setup 0 Start: {Map 1}]
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-04T02:12:16,398 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask:
AutoStartComplete
2020-09-04T02:12:16,398 INFO [TezChild] task.TaskRunner2Callable: Running
task, taskAttemptId=attempt_1599210735282_0001_1_01_000000_0
2020-09-04T02:12:16,416 INFO [TezChild] tez.ReduceRecordProcessor: Waiting for
ShuffleInputs to become ready
{code}
so in the hanging run we reach the point of ShuffleScheduler instantiation, but
than hanging here according to jstack
{code}
"ShuffleAndMergeRunner {Map_1}" #919 daemon prio=5 os_prio=0
tid=0x00007ff864012800 nid=0x1959e9 in Object.wait() [0x00007ff835ee1000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.waitAndNotifyProgress(ShuffleScheduler.java:1459)
- locked <0x00000000fa885a20> (a
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.access$700(ShuffleScheduler.java:85)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$ShuffleSchedulerCallable.callInternal(ShuffleScheduler.java:1374)
- locked <0x00000000fa885a20> (a
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$ShuffleSchedulerCallable.callInternal(ShuffleScheduler.java:1364)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.start(ShuffleScheduler.java:444)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:293)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:287)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
at
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
at
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
> TestMmCompactorOnTez/TestCrudCompactorOnTez hangs when running against Tez
> 0.10.0 staging artifact
> --------------------------------------------------------------------------------------------------
>
> Key: HIVE-24111
> URL: https://issues.apache.org/jira/browse/HIVE-24111
> Project: Hive
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Attachments: TestCrudCompactorOnTez.log, TestCrudCompactorOnTez2.log,
> jstack.log,
> org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez-output.txt
>
>
> Reproduced issue in ptest run which I made to run against tez staging
> artifacts
> (https://repository.apache.org/content/repositories/orgapachetez-1068/)
> http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-1311/14/pipeline/417
> I'm about to investigate this. I think Tez 0.10.0 cannot be released until we
> won't confirm if it's a hive or tez bug.
> {code}
> mvn test -Pitests,hadoop-2 -Dtest=TestMmCompactorOnTez -pl ./itests/hive-unit
> {code}
> tez setup:
> https://github.com/apache/hive/commit/92516631ab39f39df5d0692f98ac32c2cd320997#diff-a22bcc9ba13b310c7abfee4a57c4b130R83-R97
--
This message was sent by Atlassian Jira
(v8.3.4#803005)