[ 
https://issues.apache.org/jira/browse/HIVE-15026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Martinez Moller updated HIVE-15026:
------------------------------------------
    Attachment: explain.txt
                testcase.txt
                testcase.png

> Option to not merge the views
> -----------------------------
>
>                 Key: HIVE-15026
>                 URL: https://issues.apache.org/jira/browse/HIVE-15026
>             Project: Hive
>          Issue Type: Improvement
>          Components: Logical Optimizer, Physical Optimizer
>            Reporter: Carlos Martinez Moller
>         Attachments: explain.txt, testcase.png, testcase.txt
>
>
> Note: I am trying to simplify a real case scenario we are having and 
> simplifying the queries for the example. Hope they make sense and that the 
> proposal I am doing can be understood. The real query is a lot more complex 
> and long.
> When performing a query of this type:
> ------------------------------------------------------------------
> SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
> FROM TABLE_A
> WHERE COLUMNA=1 AND COLUMND='Case 1'
> UNION ALL
> SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
> FROM TABLE_A
> WHERE COLUMNA=10 AND COLUMNE='Case 2'
> ------------------------------------------------------------------
> This creates Three Stages. First Stage is FULL SCAN of TABLE_A + Filter 
> (COLUMNA=1/COLUMND='Case 1'),  Second Stage is FULL SCAN of TABLE_A again + 
> Filter (COLUMNA=10/COLUMNE='Case 2'), and third stage is the UNION ALL.
> TABLE_A has 2TB data of information.
> But COLUMNA=1 and COLUMNA=10 filter all together only 2GB of information.
> So I thought to use:
> --------------------------------------------------------------
> WITH TEMP_VIEW AS
> (SELECT COLUMNA,COLUMNB,COLUMNC,COLUMND
> FROM TABLE_A
> WHERE COLUMNA=1 AND COLUMNA=10)
> SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
> FROM TEMP_VIEW
> WHERE COLUMNA=1 AND COLUMND='Case 1'
> UNION ALL
> SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
> FROM TEMP_VIEW
> WHERE COLUMNA=10 AND COLUMNE='Case 2'
> ---------------------------------------------------------------
> I thought that with this it would create 4 Stages:
> - Stage 1: Full Scan of TABLE_A and generate intermediate data
> - Stage 2: In the data of Stage 1 Filter (COLUMNA=1/COLUMND='Case 1')
> - Stage 3: In the data of Stage 1 Filter (COLUMNA=10/COLUMNE='Case 2')
> - Stage 4: Union ALL
> With this instead of 4TB being read from disk, only 2TB+4GB (twice going 
> through the view) would be read (In our case complexity is even bigger and we 
> will be saving 20TB reading)
> But it does the same than in the original query. It internally pushes the 
> predicates of the "WITH" query in the two parts of the UNION.
> It would be good to have a control on this, or for the optimizer to choose 
> the best approach using histogram/statistics information.
> For those knowing Oracle RDBMS this is equivalent to the MERGE/NO_MERGE and 
> NEST behaviour:
> http://www.dba-oracle.com/t_hint_no_merge.htm as an explanation...
> Other approaches for my example could apply, as partitioning by COLUMNA of 
> BUCKETING. But are not applicable in our case as COLUMNA is not commonly used 
> when accessing this table.
> The point of this JIRA is to add a functionality similar to the one of Oracle 
> (not Merging the query, but generating an in-memory/disk temporary view) both 
> for "WITH" clauses and VIEWS.
> This is very very commonly used in Data Ware Houses managing big amounts of 
> data and provides big performance benefits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to