[ https://issues.apache.org/jira/browse/HIVE-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901637#action_12901637 ]
Namit Jain commented on HIVE-1018: ---------------------------------- I think a immediate use case is to convert the following job: select /*+ MAPJOIN(T2) */ count(1) from T1 join T2 on T1.x=T2.y; into a single MR job, if T2 is small. > pushing down group-by before joins > ---------------------------------- > > Key: HIVE-1018 > URL: https://issues.apache.org/jira/browse/HIVE-1018 > Project: Hadoop Hive > Issue Type: Improvement > Reporter: Ning Zhang > > Queries with both Group-by and Joins are very common and they are expensive > operations. By default Hive evalutes Joins first then group-by. Sometimes it > is possible to rewrite queries to apply group-by (or map-side partial group > by) first before join. This will remove a lot of duplicated keys in joins and > alleviate skewness in join keys for this case. This rewrite should be > cost-based. Before we have the stats and the CB framework, we can give users > hints to do the rewrite. > A particular case is where the join keys are the same as the grouping keys. > Or the group keys is a superset of the join keys (so that grouping won't > affect the result of joins). > Examples: > -- Q1 > select A.key, B.key > from A join B on (A.key=B.key) > group by A.key, B.key; > --Q2 > select distinct A.key, B.key > from A join B on (A.key=B.key); > --Q3, aggregation function is sum, count, min, max, (avg and median cannot be > handled). > selec A.key, sum(A.value), count(1), min(value), max(value) > from A left semi join B on (A.key=B.key) > group by A.key; > -- Q4. grouping keys is a superset of join keys > select distinct A.key, A.value > from A join B on (A.key=B.key) > In the case of join keys are not a subset of grouping keys, we can introduce > a map-side partial grouping operator with the keys of the UNION of the join > and grouping keys, to remove unnecessary duplications. This should be > cost-based though. > Any thoughts and suggestions? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.