Chengxiang Li created HIVE-10550:
------------------------------------

             Summary: Dynamic RDD caching optimization for HoS.[Spark Branch]
                 Key: HIVE-10550
                 URL: https://issues.apache.org/jira/browse/HIVE-10550
             Project: Hive
          Issue Type: Sub-task
          Components: Spark
            Reporter: Chengxiang Li


A Hive query may try to scan the same table multi times, like self-join, 
self-union, or even share the same subquery, [TPC-DS 
Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
 is an example. As you may know that, Spark support cache RDD data, which mean 
Spark would put the calculated RDD data in memory and get the data from memory 
directly for next time, this avoid the calculation cost of this RDD(and all the 
cost of its dependencies) at the cost of more memory usage. Through analyze the 
query context, we should be able to understand which part of query could be 
shared, so that we can reuse the cached RDD in the generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to