[
https://issues.apache.org/jira/browse/HIVE-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309791#comment-16309791
]
wan kun commented on HIVE-18362:
--------------------------------
Hi,[~gopalv]
What we do is similar, but there is some difference in the implementation. My
implementation is to take the table or partition's ROW_COUNT information
directly from hive metastore, which does not need additional calculations.
I also have a few questions to ask:
1. Why do you use NDV instead of using ROW_COUNT directly ? I think NDV will be
less than the actual number of ROW, but the actual memory is linearly related
to the number of ROW.
2., I'm sorry, I haven't had the test environment of hive 2.* for a while. Hive
branch-2.* depends on ColStatistics's statistics. Can you tell me where does
ColStatistics come from ? Is this nesessary to add extra calculation for
additional column statistics before our job?
3. The checkNumberOfEntriesForHashTable function only checks the number of
Entry of one RS at a time. Does it happen that multiple map table is loaded
into memory together, resulting in OOM?
There are also two following questions:
1. ConvertJoinMapJoin optimization is only used in TezCompiler ? Spark use
SparkMapJoinOptimizer. There is no optimizer for MapReduce ?
2. in hive branch-1.2 does not have this part of the code (but this parameter
is added in hive-default.xml.template, which should not be effective)
> Introduce a parameter to control the max row number for map join convertion
> ---------------------------------------------------------------------------
>
> Key: HIVE-18362
> URL: https://issues.apache.org/jira/browse/HIVE-18362
> Project: Hive
> Issue Type: Bug
> Components: Query Processor
> Reporter: wan kun
> Assignee: Gopal V
> Priority: Minor
> Attachments: HIVE-18362-branch-1.2.patch
>
>
> The compression ratio of the Orc compressed file will be very high in some
> cases.
> The test table has three Int columns, with twelve million records, but the
> compressed file size is only 4M. Hive will automatically converts the Join to
> Map join, but this will cause memory overflow. So I think it is better to
> have a parameter to limit to the total number of table records in the Map
> Join convertion, and if the total number of records is larger than that, it
> can not be converted to Map join.
> *hive.auto.convert.join.max.number = 2500000L*
> The default value for this parameter is 2500000, because so many records
> occupy about 700M memory in clint JVM, and 2500000 records for Map Join are
> also large tables.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)