[
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025977#comment-15025977
]
Ashutosh Chauhan commented on HIVE-12491:
-----------------------------------------
I guess what Gopal is pointing out is multiple PK case is missing which might
help this use case. (as demonstrated in his WIP patch).
Other thing is we failed to recognize that out of 3 columns, two are different
udfs on same column, so we incorrectly computed denom for that. Ideally, we
need to fix both but doing atleast one of these two will help.
> Column Statistics: 3 attribute join on a 2-source table is off
> --------------------------------------------------------------
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
> Issue Type: Bug
> Affects Versions: 1.3.0, 2.0.0
> Reporter: Gopal V
> Assignee: Prasanth Jayachandran
> Attachments: HIVE-12491.WIP.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different
> attributes.
> {code}
> private Long getEasedOutDenominator(List<Long> distinctVals) {
> // Exponential back-off for NDVs.
> // 1) Descending order sort of NDVs
> // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * ....
> Collections.sort(distinctVals, Collections.reverseOrder());
> long denom = distinctVals.get(0);
> for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 <<
> i)));
> }
> return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
> key expressions: _col0 (type: bigint), year(_col2) (type: int),
> month(_col2) (type: int)
> sort order: +++
> Map-reduce partition columns: _col0 (type: bigint), year(_col2)
> (type: int), month(_col2) (type: int)
> value expressions: _col1 (type: bigint)
> Join Operator (JOIN_13)
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col0 (type: bigint), year(_col1) (type: int), month(_col1)
> (type: int)
> 1 _col0 (type: bigint), year(_col2) (type: int), month(_col2)
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs
> in map-joins.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)