[jira] [Updated] (CALCITE-1069) Grouping ID mplementation to support Hive

Hari Sankar Sivarama Subramaniyan (JIRA) Wed, 27 Jan 2016 12:54:50 -0800

     [ 
https://issues.apache.org/jira/browse/CALCITE-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hari Sankar Sivarama Subramaniyan updated CALCITE-1069:
-------------------------------------------------------
    Description: 
Grouping sets are currently implemented in Calcite using a bit to indicate each
of the grouping columns. For instance, consider the following group by clause:

GROUP BY CUBE (a, b)

The generated Aggregate operator in Calcite will have a row schema consisting 
of [a, b, GROUPING(a), GROUPING(b)], where GROUPING( x ) is a boolean field 
indicator which represents whether x is participating in the group by clause.

In contrast, Hive's implementation stores a single number corresponding to the 
GROUPING bit vector associated with a row (this is the result of the 
GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the row 
schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)].

This difference is creating a mismatch between Calcite and Hive. As of now, we 
work around this mismatch in the Hive side: we create our own GROUPING_ID 
function applied over those columns. However, we have some issues related to 
predicates pushdown, constant propagation, join project transpose rule 
(HIVE-12923)
etc., that we need to continue solving as e.g. new rules are added to our 
optimizer. In short, this is making the code on the Hive side harder and harder 
to maintain. 

This jira is intended to modify the implementation on the Calcite side to that 
we need not make workarounds/hacks in Hive to support Grouping IDs.

  was:
Grouping sets are currently implemented in Calcite using a bit to indicate each
of the grouping columns. For instance, consider the following group by clause:

GROUP BY CUBE (a, b)

The generated Aggregate operator in Calcite will have a row schema consisting 
of [a, b, GROUPING(a), GROUPING(b)], where GROUPING( x ) is a boolean field 
indicator which represents whether x is participating
in the group by clause.

In contrast, Hive's implementation stores a single number corresponding to the
GROUPING bit vector associated with a row (this is the result of the 
GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the row 
schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)].

This difference is creating a mismatch between Calcite and Hive. As of now, we 
work around this mismatch in the Hive side: we create our own GROUPING_ID 
function applied over those
columns. However, we have some issues related to predicates pushdown, constant 
propagation, join project transpose rule (HIVE-12923)
etc., that we need to continue solving as e.g. new rules are added to our 
optimizer. In short, this is making the code on the Hive side harder and harder 
to maintain. 

This jira is intended to modify the implementation on the
Calcite side to that we need not make workarounds/hacks in Hive to support 
Grouping IDs.


> Grouping ID mplementation to support Hive
> -----------------------------------------
>
>                 Key: CALCITE-1069
>                 URL: https://issues.apache.org/jira/browse/CALCITE-1069
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Hari Sankar Sivarama Subramaniyan
>            Assignee: Julian Hyde
>
> Grouping sets are currently implemented in Calcite using a bit to indicate 
> each
> of the grouping columns. For instance, consider the following group by clause:
> GROUP BY CUBE (a, b)
> The generated Aggregate operator in Calcite will have a row schema consisting 
> of [a, b, GROUPING(a), GROUPING(b)], where GROUPING( x ) is a boolean field 
> indicator which represents whether x is participating in the group by clause.
> In contrast, Hive's implementation stores a single number corresponding to 
> the GROUPING bit vector associated with a row (this is the result of the 
> GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the 
> row schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)].
> This difference is creating a mismatch between Calcite and Hive. As of now, 
> we work around this mismatch in the Hive side: we create our own GROUPING_ID 
> function applied over those columns. However, we have some issues related to 
> predicates pushdown, constant propagation, join project transpose rule 
> (HIVE-12923)
> etc., that we need to continue solving as e.g. new rules are added to our 
> optimizer. In short, this is making the code on the Hive side harder and 
> harder to maintain. 
> This jira is intended to modify the implementation on the Calcite side to 
> that we need not make workarounds/hacks in Hive to support Grouping IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CALCITE-1069) Grouping ID mplementation to support Hive

Reply via email to