[
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888667#comment-17888667
]
yongzhi.shao edited comment on TEZ-4542 at 10/11/24 2:13 PM:
-------------------------------------------------------------
[~abstractdog] [~zhengchenyu] [~glapark] [~yigress]
hello.
We are now heavily using TEZ 0.10.4.
We have observed that after adding this PR, PipelinedSorter.merger.futures[]
may continue to store a large number of elements, and this List<Future> alone
may run out of memory by more than 8GB (the maximum number of elements in this
ArrayList can be in the tens of millions).This leads to a lot of OOM jobs, and
a lot of slow queries.
By rolling back TEZ-4542, we solved this issue, and given the short time since
the release of tez-0.10.4, I would suggest that the community rolls back this
PR for the time being, and that this issue may need to be discussed in more
detail by all before it can be finalised.
was (Author: lisoda):
[~abstractdog] [~zhengchenyu][~glapark][~yigress]
hello.
We are now heavily using TEZ 0.10.4.
We have observed that after adding this PR, PipelinedSorter.merger.futures[]
may continue to store a large number of elements, and this List<Future> alone
may run out of memory by more than 8GB (the maximum number of elements in this
ArrayList can be in the tens of millions).This leads to a lot of OOM jobs, and
a lot of slow queries.
By rolling back TEZ-4542, we solved this issue, and given the short time since
the release of tez-0.10.4, I would suggest that the community rolls back this
PR for the time being, and that this issue may need to be discussed in more
detail by all before it can be finalised. !screenshot-1.png!
> Tez application may fail due to int overflow when record size is large and
> sort memory is low.
> ----------------------------------------------------------------------------------------------
>
> Key: TEZ-4542
> URL: https://issues.apache.org/jira/browse/TEZ-4542
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.9.2
> Reporter: Chenyu Zheng
> Assignee: Chenyu Zheng
> Priority: Major
> Fix For: 0.10.4
>
> Attachments: screenshot-1.png
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> Tez application application fail, then found this error stack:
> {code:java}
> at
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
> at
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
> ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.IllegalArgumentException
> at
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
> at
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
> at
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
> at
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
> at
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
> at
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
> at
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
> at
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
> at
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
> ... 19 more
> Caused by: java.lang.IllegalArgumentException
> at java.nio.Buffer.position(Buffer.java:244)
> at
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
> at
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
> at
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
> at
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
> at
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
> at
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
> at
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
> ... 28 more {code}
> After adding the debug log, it is easy to find this problem. The variable
> `dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow.
> This problem will be triggered if the following two conditions are met at the
> same time:
> * Too many IO for vertex, causing the memory allocated to each I/O for
> sorting to be too small.
> * When average record size is larger than 2K, `dataSize` in
> {{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not
> try to allocate less meta space. Then raise exception.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)