[jira] [Comment Edited] (HIVE-28583) In the case of subqueries, HIVE often incorrectly uses MAP-JOIN for large tables.

yongzhi.shao (Jira) Sat, 08 Nov 2025 07:58:31 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-28583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036524#comment-18036524
 ]


yongzhi.shao edited comment on HIVE-28583 at 11/8/25 3:56 PM:
--------------------------------------------------------------

[~dkuzmenko] 

sir,It doesn’t seem to be that issue. I encountered this problem with both 
version 4.0.1 and a downstream master branch code (close to 4.1.0). As long as 
ZSTD is used, because the compression ratio is high enough, some originally 
large tables get compressed very small and are then mistakenly broadcast, 
causing the hash table to become too large.

 

I mentioned a similar issue in [HIVE-28979] ORC with ZSTD as default - ASF JIRA 
Incorrect projected column size after ORC upgrade to v1.6.7 - ASF JIRA


was (Author: lisoda):
[~dkuzmenko] 

sir,It doesn’t seem to be that issue. I encountered this problem with both 
version 4.0.1 and a downstream master branch code (close to 4.1.0). As long as 
ZSTD is used, because the compression ratio is high enough, some originally 
large tables get compressed very small and are then mistakenly broadcast, 
causing the hash table to become too large.

 

I mentioned a similar issue in [HIVE-28202] Incorrect projected column size 
after ORC upgrade to v1.6.7 - ASF JIRA

> In the case of subqueries, HIVE often incorrectly uses MAP-JOIN for large 
> tables.
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-28583
>                 URL: https://issues.apache.org/jira/browse/HIVE-28583
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.1
>            Reporter: yongzhi.shao
>            Priority: Major
>
> Hello. Team.
>  
> Currently we found that in version 4.0.1, HIVE many many occasions will be 
> wrongly estimated table size, and thus incorrectly use MapJoin to optimise, 
> ultimately leading to a series of problems such as 
> OOM/MapJoinMemoryExhaustionError.
>  
> In addition, I have found that the problem is more likely to occur when I use 
> ORC tables with the ZSTD compression algorithm.
>  
> We found a typical scenario as follows:
> {code:java}
> ---dataset size
> select
> c1,c2,c3
> from big_table_2;    10~50GB
>  
> big_table_1   1TB；
>  
>  
> ----- use map join. and cause oom/MapJoinMemoryExhaustionError
> select
> *
> from 
> big_table_1    t1
> join
> (
> select
> c1,c2,c3
> from big_table_2
> ) t2  on xxxxx;  
>  
>  
>  
> ----- use smj. no map join. job success
> create table t2 as 
> select
> c1,c2,c3
> from big_table_2;
> select
> *
> from 
> big_table_1  t1
> join
> t2  on xxxxx;  
>  {code}
> The above SQL can be executed normally in HIVE3.
> Can anyone guide me on how to deal with this kind of problem?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HIVE-28583) In the case of subqueries, HIVE often incorrectly uses MAP-JOIN for large tables.

Reply via email to