[jira] [Commented] (HIVE-17043) Remove non unique columns from group by keys if not referenced later

Vineet Garg (JIRA) Fri, 21 Sep 2018 13:33:07 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-17043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624133#comment-16624133
 ]


Vineet Garg commented on HIVE-17043:
------------------------------------

[~jcamachorodriguez] I agree it is ugly. The problem with 
{{RelMdColumnUniqueness}} is that it only tells you if given set of columns are 
unique or not, for this optimization we need to know the set of unique keys (if 
there are any for a given input). Therefore {{RelMdColumnUniqueness}} wouldn't 
really work here.

Another possible solution I could think of was calling {{getColumnOrigin}} on 
each group key to track lineage and build the set, then calling 
{{getTableOrigin}} to get to the base table using which we can figure out the 
keys, get rid of the corresponding columns from group sets. But this will be 
pretty expensive (calling getColumnOrigin on all the keys and then calling 
getTableOrigin).

I think we should keep RelMdUniqueKeys for determining unique keys based on the 
constraints, it seems like it is designed for this. We can write (preferably in 
later patch) different logic/methods for getRowCount to use (which will be 
based on stats) since  it only override project to determine uniqueness based 
on statistics.

Let me know what you think.



> Remove non unique columns from group by keys if not referenced later
> --------------------------------------------------------------------
>
>                 Key: HIVE-17043
>                 URL: https://issues.apache.org/jira/browse/HIVE-17043
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Logical Optimizer
>    Affects Versions: 3.0.0
>            Reporter: Ashutosh Chauhan
>            Assignee: Vineet Garg
>            Priority: Major
>         Attachments: HIVE-17043.1.patch, HIVE-17043.2.patch, 
> HIVE-17043.3.patch, HIVE-17043.4.patch
>
>
> Group by keys may be a mix of unique (or primary) keys and regular columns. 
> In such cases presence of regular column won't alter cardinality of groups. 
> So, if regular columns are not referenced later, they can be dropped from 
> group by keys. Depending on operator tree may result in those columns not 
> being read at all from disk in best case. In worst case, we will avoid 
> shuffling and sorting regular columns from mapper to reducer, which still 
> could be substantial CPU and network savings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-17043) Remove non unique columns from group by keys if not referenced later

Reply via email to