[
https://issues.apache.org/jira/browse/HIVE-24111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190804#comment-17190804
]
László Bodor edited comment on HIVE-24111 at 9/4/20, 4:13 PM:
--------------------------------------------------------------
For reference:
logs for a good run:
[^org.apache.hadoop.hive.ql.txn.compactor.TestMmCompactorOnTez-output.txt.log]
logs for a hanging run: [^TestCrudCompactorTez.log]
what is strange for the first sight, I cannot see MergeManager related log
messages when it's expected, so this could be a shuffle issue
good run:
{code}
2020-09-03T15:13:19,604 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None,
ifileReadAhead: true
2020-09-03T15:13:19,605 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847,
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10,
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-03T15:13:19,605 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map
1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1,
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1,
hostFailureFraction=0.2, minFailurePerHost=4,
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5,
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-03T15:13:19,606 INFO [I/O Setup 0 Start: {Map 1}]
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-03T15:13:19,606 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask:
AutoStartComplete
2020-09-03T15:13:19,606 INFO [ShuffleAndMergeRunner {Map_1}]
orderedgrouped.MergeManager: Setting merger's parent thread to
ShuffleAndMergeRunner {Map_1}
2020-09-03T15:13:19,606 INFO [TezChild] task.TaskRunner2Callable: Running
task, taskAttemptId=attempt_1599171197926_0001_1_01_000000_0
2020-09-03T15:13:19,607 INFO
[TezTaskEventRouter{attempt_1599171197926_0001_1_01_000000_0}]
orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: Map 1:
numDmeEventsSeen=1, numDmeEventsSeenWithNoData=0, numObsoletionEventsSeen=0
2020-09-03T15:13:19,607 INFO [TezChild] exec.SerializationUtilities:
Deserializing ReduceWork using kryo
2020-09-03T15:13:19,607 INFO [TezChild] exec.Utilities: Deserialized plan (via
RPC) - name: Reducer 2 size: 1.87KB
2020-09-03T15:13:19,607 INFO [TezChild] tez.ObjectCache: Caching key:
lbodor_20200903151317_7f539b53-07fb-4bb1-97db-c37d72aba99d_Reducer
2__REDUCE_PLAN__
2020-09-03T15:13:19,607 INFO [TezChild] tez.RecordProcessor: conf class path =
[]
2020-09-03T15:13:19,608 INFO [TezChild] tez.RecordProcessor: thread class path
= []
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0]
orderedgrouped.MergeManager: close onDiskFile. State: NumOnDiskFiles=1.
Current:
path=/Users/lbodor/apache/hive/itests/hive-unit/target/tmp/scratchdir/lbodor/_tez_session_dir/e01fa9d5-36d9-4449-bfa4-d12b5fa290f8/.tez/application_1599171197926_0001_wd/localmode-local-dir/output/attempt_1599171197926_0001_1_00_000000_0_10098/file.out,
len=26
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0] ShuffleScheduler.fetch:
Completed fetch for attempt: {0, 0,
attempt_1599171197926_0001_1_00_000000_0_10098} to DISK_DIRECT, csize=26,
dsize=22, EndTime=1599171199608, TimeTaken=1, Rate=0.02 MB/s
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0]
orderedgrouped.ShuffleScheduler: All inputs fetched for input vertex : Map 1
{code}
hanging run:
{code}
2020-09-04T02:12:16,392 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None,
ifileReadAhead: true
2020-09-04T02:12:16,392 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847,
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10,
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-04T02:12:16,394 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map
1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1,
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1,
hostFailureFraction=0.2, minFailurePerHost=4,
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5,
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-04T02:12:16,398 INFO [I/O Setup 0 Start: {Map 1}]
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-04T02:12:16,398 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask:
AutoStartComplete
2020-09-04T02:12:16,398 INFO [TezChild] task.TaskRunner2Callable: Running
task, taskAttemptId=attempt_1599210735282_0001_1_01_000000_0
2020-09-04T02:12:16,416 INFO [TezChild] tez.ReduceRecordProcessor: Waiting for
ShuffleInputs to become ready
{code}
so in the hanging run we reach the point of ShuffleScheduler instantiation, but
never reach this point:
{code}
2020-09-03T15:13:19,606 INFO [ShuffleAndMergeRunner {Map_1}]
orderedgrouped.MergeManager: Setting merger's parent thread to
ShuffleAndMergeRunner {Map_1}
{code}
was (Author: abstractdog):
For reference:
logs for a good run:
[^org.apache.hadoop.hive.ql.txn.compactor.TestMmCompactorOnTez-output.txt.log]
logs for a hanging run: [^TestCrudCompactorTez.log]
what is strange for the first sight, I cannot see MergeManager related log
messages when it's expected, so this could be a shuffle issue
good run:
{code}
2020-09-03T15:13:19,604 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None,
ifileReadAhead: true
2020-09-03T15:13:19,605 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847,
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10,
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-03T15:13:19,605 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map
1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1,
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1,
hostFailureFraction=0.2, minFailurePerHost=4,
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5,
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-03T15:13:19,606 INFO [I/O Setup 0 Start: {Map 1}]
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-03T15:13:19,606 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask:
AutoStartComplete
2020-09-03T15:13:19,606 INFO [ShuffleAndMergeRunner {Map_1}]
orderedgrouped.MergeManager: Setting merger's parent thread to
ShuffleAndMergeRunner {Map_1}
2020-09-03T15:13:19,606 INFO [TezChild] task.TaskRunner2Callable: Running
task, taskAttemptId=attempt_1599171197926_0001_1_01_000000_0
2020-09-03T15:13:19,607 INFO
[TezTaskEventRouter{attempt_1599171197926_0001_1_01_000000_0}]
orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: Map 1:
numDmeEventsSeen=1, numDmeEventsSeenWithNoData=0, numObsoletionEventsSeen=0
2020-09-03T15:13:19,607 INFO [TezChild] exec.SerializationUtilities:
Deserializing ReduceWork using kryo
2020-09-03T15:13:19,607 INFO [TezChild] exec.Utilities: Deserialized plan (via
RPC) - name: Reducer 2 size: 1.87KB
2020-09-03T15:13:19,607 INFO [TezChild] tez.ObjectCache: Caching key:
lbodor_20200903151317_7f539b53-07fb-4bb1-97db-c37d72aba99d_Reducer
2__REDUCE_PLAN__
2020-09-03T15:13:19,607 INFO [TezChild] tez.RecordProcessor: conf class path =
[]
2020-09-03T15:13:19,608 INFO [TezChild] tez.RecordProcessor: thread class path
= []
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0]
orderedgrouped.MergeManager: close onDiskFile. State: NumOnDiskFiles=1.
Current:
path=/Users/lbodor/apache/hive/itests/hive-unit/target/tmp/scratchdir/lbodor/_tez_session_dir/e01fa9d5-36d9-4449-bfa4-d12b5fa290f8/.tez/application_1599171197926_0001_wd/localmode-local-dir/output/attempt_1599171197926_0001_1_00_000000_0_10098/file.out,
len=26
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0] ShuffleScheduler.fetch:
Completed fetch for attempt: {0, 0,
attempt_1599171197926_0001_1_00_000000_0_10098} to DISK_DIRECT, csize=26,
dsize=22, EndTime=1599171199608, TimeTaken=1, Rate=0.02 MB/s
2020-09-03T15:13:19,608 INFO [Fetcher_O {Map_1} #0]
orderedgrouped.ShuffleScheduler: All inputs fetched for input vertex : Map 1
{code}
hanging run:
{code}
2020-09-04T02:12:16,392 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.Shuffle: Map_1: Shuffle assigned with 1 inputs, codec: None,
ifileReadAhead: true
2020-09-04T02:12:16,392 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.MergeManager: Map 1: MergerManager: memoryLimit=1278984847,
maxSingleShuffleLimit=319746208, mergeThreshold=844130048, ioSortFactor=10,
postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-04T02:12:16,394 INFO [I/O Setup 0 Start: {Map 1}]
orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: Map
1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1,
abortFailureLimit=15, maxTaskOutputAtOnce=20, numFetchers=1,
hostFailureFraction=0.2, minFailurePerHost=4,
maxAllowedFailedFetchFraction=0.5, maxStallTimeFraction=0.5,
minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-04T02:12:16,398 INFO [I/O Setup 0 Start: {Map 1}]
runtime.LogicalIOProcessorRuntimeTask: Started Input with src edge: Map 1
2020-09-04T02:12:16,398 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask:
AutoStartComplete
2020-09-04T02:12:16,398 INFO [TezChild] task.TaskRunner2Callable: Running
task, taskAttemptId=attempt_1599210735282_0001_1_01_000000_0
2020-09-04T02:12:16,416 INFO [TezChild] tez.ReduceRecordProcessor: Waiting for
ShuffleInputs to become ready
{code}
> TestMmCompactorOnTez hangs when running against Tez 0.10.0 staging artifact
> ---------------------------------------------------------------------------
>
> Key: HIVE-24111
> URL: https://issues.apache.org/jira/browse/HIVE-24111
> Project: Hive
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Attachments: TestCrudCompactorTez.log, jstack.log,
> org.apache.hadoop.hive.ql.txn.compactor.TestMmCompactorOnTez-output.txt.log
>
>
> Reproduced issue in ptest run which I made to run against tez staging
> artifacts
> (https://repository.apache.org/content/repositories/orgapachetez-1068/)
> http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-1311/14/pipeline/417
> I'm about to investigate this. I think Tez 0.10.0 cannot be released until we
> won't confirm if it's a hive or tez bug.
> {code}
> mvn test -Pitests,hadoop-2 -Dtest=TestMmCompactorOnTez -pl ./itests/hive-unit
> {code}
> tez setup:
> https://github.com/apache/hive/commit/92516631ab39f39df5d0692f98ac32c2cd320997#diff-a22bcc9ba13b310c7abfee4a57c4b130R83-R97
--
This message was sent by Atlassian Jira
(v8.3.4#803005)