[ 
https://issues.apache.org/jira/browse/HIVE-16757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated HIVE-16757:
--------------------------------
    Description: 
Calling Calcite's {{RelMetadataQuery.instance()}} is very expensive because it 
places a new memoization cache on the stack. Hidden in the deperecated 
{{AbstractRelNode.getRows()}} call is a call to {{instance()}}. In hive we have 
a number of places where we're calling the deprecated {{getRows()}} instead of 
the new API {{estimateRowCount(RelMetadataQuery mq)}} which accepts the 
RelMetadataQuery, which most places we actually have it handy to pass. On 
looking at the a complex query (49 joins) there are 2995340 calls to 
{{AbstractRelNode.getRows}}, each one busting the current memoization cache 
away.


Was: -On complex queries HiveRelMdRowCount.getRowCount can get called many 
times. since it does not memoize its result and the call is recursive, it 
results in an explosion of calls. for example a query with 49 joins, during 
join ordering (LoptOtimizerJoinRule) the HiveRelMdRowCount.getRowCount gets 
called 6442 as a top level call, but the recursivity exploded this to 501729 
calls. Memoization of the rezult would stop the recursion early. In my testing 
this reduced the join reordering time for said query from 11s to <1s..-

Note there is no need for {{HiveRelMdRowCount}} memoization because the 
function is called in stacks similar to this:
{code}
        at 
org.apache.hadoop.hive.ql.optimizer.calcite.stats.HiveRelMdRowCount.getRowCount(HiveRelMdRowCount.java:66)
        at GeneratedMetadataHandler_RowCount.getRowCount_$
        at GeneratedMetadataHandler_RowCount.getRowCount
        at 
org.apache.calcite.rel.metadata.RelMetadataQuery.getRowCount(RelMetadataQuery.java:204)
        at 
org.apache.calcite.rel.rules.LoptOptimizeJoinRule.swapInputs(LoptOptimizeJoinRule.java:1865)
        at 
org.apache.calcite.rel.rules.LoptOptimizeJoinRule.createJoinSubtree(LoptOptimizeJoinRule.java:1739)
{code}
and {{GeneratedMetadataHandler_RowCount.getRowCount}} handles memoization.

  was:
Calling Calcite's {{RelMetadataQuery.instance()}} is very expensive because it 
places a new memoization cache on the stack. Hidden in the deperecated 
{{AbstractRelNode.getRows()}} call is a call to {{instance()}}. In hive we have 
a number of places where we're calling the deprecated {{getRows()}} instead of 
the new API {{estimateRowCount(RelMetadataQuery mq)}} which accepts the 
RelMetadataQuery, which most places we actually have it handy to pass. On 
looking at the a complex query (49 joins) there are 2995340 calls to 
{{AbstractRelNode.getRows}}, each one busting the current memoization cache 
away.


Was: -On complex queries HiveRelMdRowCount.getRowCount can get called many 
times. since it does not memoize its result and the call is recursive, it 
results in an explosion of calls. for example a query with 49 joins, during 
join ordering (LoptOtimizerJoinRule) the HiveRelMdRowCount.getRowCount gets 
called 6442 as a top level call, but the recursivity exploded this to 501729 
calls. Memoization of the rezult would stop the recursion early. In my testing 
this reduced the join reordering time for said query from 11s to <1s..-




> Use memoization in HiveRelMdRowCount.getRowCount
> ------------------------------------------------
>
>                 Key: HIVE-16757
>                 URL: https://issues.apache.org/jira/browse/HIVE-16757
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning
>            Reporter: Remus Rusanu
>            Assignee: Remus Rusanu
>         Attachments: HIVE-16757.01.patch, HIVE-16757.02.patch
>
>
> Calling Calcite's {{RelMetadataQuery.instance()}} is very expensive because 
> it places a new memoization cache on the stack. Hidden in the deperecated 
> {{AbstractRelNode.getRows()}} call is a call to {{instance()}}. In hive we 
> have a number of places where we're calling the deprecated {{getRows()}} 
> instead of the new API {{estimateRowCount(RelMetadataQuery mq)}} which 
> accepts the RelMetadataQuery, which most places we actually have it handy to 
> pass. On looking at the a complex query (49 joins) there are 2995340 calls to 
> {{AbstractRelNode.getRows}}, each one busting the current memoization cache 
> away.
> Was: -On complex queries HiveRelMdRowCount.getRowCount can get called many 
> times. since it does not memoize its result and the call is recursive, it 
> results in an explosion of calls. for example a query with 49 joins, during 
> join ordering (LoptOtimizerJoinRule) the HiveRelMdRowCount.getRowCount gets 
> called 6442 as a top level call, but the recursivity exploded this to 501729 
> calls. Memoization of the rezult would stop the recursion early. In my 
> testing this reduced the join reordering time for said query from 11s to 
> <1s..-
> Note there is no need for {{HiveRelMdRowCount}} memoization because the 
> function is called in stacks similar to this:
> {code}
>       at 
> org.apache.hadoop.hive.ql.optimizer.calcite.stats.HiveRelMdRowCount.getRowCount(HiveRelMdRowCount.java:66)
>       at GeneratedMetadataHandler_RowCount.getRowCount_$
>       at GeneratedMetadataHandler_RowCount.getRowCount
>       at 
> org.apache.calcite.rel.metadata.RelMetadataQuery.getRowCount(RelMetadataQuery.java:204)
>       at 
> org.apache.calcite.rel.rules.LoptOptimizeJoinRule.swapInputs(LoptOptimizeJoinRule.java:1865)
>       at 
> org.apache.calcite.rel.rules.LoptOptimizeJoinRule.createJoinSubtree(LoptOptimizeJoinRule.java:1739)
> {code}
> and {{GeneratedMetadataHandler_RowCount.getRowCount}} handles memoization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to