[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2014-04-23 Thread Kevin Wilfong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13978736#comment-13978736
 ] 

Kevin Wilfong commented on HIVE-2621:
-

It's been a while, but I the definition you posted looks correct.

 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Fix For: 0.9.0

 Attachments: ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.1.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.2.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.3.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.4.patch, HIVE-2621.1.patch.txt


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2014-04-23 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13978907#comment-13978907
 ] 

Lefty Leverenz commented on HIVE-2621:
--

Thanks Kevin.

 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Fix For: 0.9.0

 Attachments: ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.1.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.2.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.3.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.4.patch, HIVE-2621.1.patch.txt


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2014-04-22 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13976525#comment-13976525
 ] 

Lefty Leverenz commented on HIVE-2621:
--

Asking again:  Is the definition of *hive.multigroupby.singlereducer* correct 
or was it just held over from *hive.multigroupby.singlemr*?  (See previous 
comment.)

 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Fix For: 0.9.0

 Attachments: ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.1.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.2.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.3.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.4.patch, HIVE-2621.1.patch.txt


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2014-04-06 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961641#comment-13961641
 ] 

Lefty Leverenz commented on HIVE-2621:
--

This jira removed the configuration property *hive.multigroupby.singlemr* 
(HIVE-2056) and added *hive.multigroupby.singlereducer*.

 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Fix For: 0.9.0

 Attachments: ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.1.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.2.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.3.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.4.patch, HIVE-2621.1.patch.txt


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2014-04-06 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961645#comment-13961645
 ] 

Lefty Leverenz commented on HIVE-2621:
--

I added *hive.multigroupby.singlereducer* to the Configuration Properties 
wikidoc.  It has the same definition as the defunct 
*hive.multigroupby.singlemr* -- is that correct?

bq.  Whether to optimize multi group by query to generate a single M/R  job 
plan. If the multi group by query has common group by keys, it will be 
optimized to generate a single M/R job.

* [Configuration Properties:  hive.multigroupby.singlemr 
|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.multigroupby.singlemr]
* [Configuration Properties:  hive.multigroupby.singlereducer 
|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.multigroupby.singlereducer]

 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Fix For: 0.9.0

 Attachments: ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.1.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.2.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.3.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.4.patch, HIVE-2621.1.patch.txt


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2013-01-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548328#comment-13548328
 ] 

Hudson commented on HIVE-2621:
--

Integrated in Hive-trunk-hadoop2 #54 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2/54/])
HIVE-2621:Allow multiple group bys with the same input data and spray keys 
to be run on the same reducer. (Kevin via He Yongqiang) (Revision 1226903)

 Result = ABORTED
heyongqiang : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1226903
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
* /hive/trunk/ql/src/test/queries/clientpositive/groupby10.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby7_map.q
* 
/hive/trunk/ql/src/test/queries/clientpositive/groupby7_map_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby7_noskew.q
* 
/hive/trunk/ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby8.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby9.q
* 
/hive/trunk/ql/src/test/queries/clientpositive/groupby_complex_types_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/multigroupby_singlemr.q
* /hive/trunk/ql/src/test/results/clientpositive/groupby10.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/groupby7_map_multi_single_reducer.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/groupby7_noskew_multi_single_reducer.q.out
* /hive/trunk/ql/src/test/results/clientpositive/groupby8.q.out
* /hive/trunk/ql/src/test/results/clientpositive/groupby9.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/groupby_complex_types_multi_single_reducer.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/groupby_multi_single_reducer.q.out
* /hive/trunk/ql/src/test/results/clientpositive/multi_insert.q.out
* /hive/trunk/ql/src/test/results/clientpositive/multigroupby_singlemr.q.out
* /hive/trunk/ql/src/test/results/clientpositive/parallel.q.out


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Fix For: 0.9.0

 Attachments: ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.1.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.2.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.3.patch, 
 ASF.LICENSE.NOT.GRANTED--HIVE-2621.D567.4.patch, HIVE-2621.1.patch.txt


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2012-01-03 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179136#comment-13179136
 ] 

Hudson commented on HIVE-2621:
--

Integrated in Hive-trunk-h0.21 #1182 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/1182/])
HIVE-2621:Allow multiple group bys with the same input data and spray keys 
to be run on the same reducer. (Kevin via He Yongqiang)

heyongqiang : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1226903
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
* /hive/trunk/ql/src/test/queries/clientpositive/groupby10.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby7_map.q
* 
/hive/trunk/ql/src/test/queries/clientpositive/groupby7_map_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby7_noskew.q
* 
/hive/trunk/ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby8.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby9.q
* 
/hive/trunk/ql/src/test/queries/clientpositive/groupby_complex_types_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/multigroupby_singlemr.q
* /hive/trunk/ql/src/test/results/clientpositive/groupby10.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/groupby7_map_multi_single_reducer.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/groupby7_noskew_multi_single_reducer.q.out
* /hive/trunk/ql/src/test/results/clientpositive/groupby8.q.out
* /hive/trunk/ql/src/test/results/clientpositive/groupby9.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/groupby_complex_types_multi_single_reducer.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/groupby_multi_single_reducer.q.out
* /hive/trunk/ql/src/test/results/clientpositive/multi_insert.q.out
* /hive/trunk/ql/src/test/results/clientpositive/multigroupby_singlemr.q.out
* /hive/trunk/ql/src/test/results/clientpositive/parallel.q.out


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch, HIVE-2621.D567.4.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-24 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175775#comment-13175775
 ] 

Phabricator commented on HIVE-2621:
---

heyongqiang has accepted the revision HIVE-2621 [jira] Allow multiple group 
bys with the same input data and spray keys to be run on the same reducer..

REVISION DETAIL
  https://reviews.facebook.net/D567


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch, HIVE-2621.D567.4.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-22 Thread Namit Jain (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174891#comment-13174891
 ] 

Namit Jain commented on HIVE-2621:
--

Let me take a look at the code again:

But the general flow should be as follows:

if  hive.multigroupby.singlereducer is true (which should always be),
  find common distincts. 
(or the check hive.multigroupby.singlereducer can be done inside find 
common distincts function itself)
  if common distincts == null
 old (current) approach - map side aggr should be used
  else:
 new code path

What do you think ? That way, we are guaranteed that the existing behavior is 
not changed.
This new parameter is only affecting distincts, and we it is very easy to turn 
it off

I know the code is kind of messy here, but can you spend some time to 
modularize it,
and reuse as much as possible ?



 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-22 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174922#comment-13174922
 ] 

Phabricator commented on HIVE-2621:
---

njain has commented on the revision HIVE-2621 [jira] Allow multiple group bys 
with the same input data and spray keys to be run on the same reducer..

  otherwise it looks good

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:5925 can 
you give an example in the comments ?

  Sorry, but it is not clear to me.

  Do you want to return 2 lists - one for the common distincts ?
  I am missing something: what else do you want to return ?


  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3469 thanks 
for creating this function and re-using this code

REVISION DETAIL
  https://reviews.facebook.net/D567


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-22 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174927#comment-13174927
 ] 

Phabricator commented on HIVE-2621:
---

kevinwilfong has commented on the revision HIVE-2621 [jira] Allow multiple 
group bys with the same input data and spray keys to be run on the same 
reducer..

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:5925 I 
return a list of lists of clause names (which map to subqueries) where the 
queries mapped to by each clause name in a list all have the same distinct and 
group by keys.

  It doesn't return common distincts.

  I'll try to make the comment clearer.

REVISION DETAIL
  https://reviews.facebook.net/D567


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-22 Thread Kevin Wilfong (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174945#comment-13174945
 ] 

Kevin Wilfong commented on HIVE-2621:
-

There are currently two ways of getting common distincts, the current way 
checks that all distinct expressions in the subqueries are the same.  My new 
code doesn't depend on this, it tries to construct subsets of the subqueries 
such that this is true for each subset.

The advantage of doing it in the form
if (optimizeMultiGroupBy) {
  ...
} else {
  group queries by common distinct and group by expressions
  for each group:
if (size of group  1  etc.) {
  new code
} else {
  old code
}
}

is that the block of code inside the optimizeMultiGroupBy if statement can 
produce 2 map reduce jobs where the new code might produce many.

After looking at it more carefully, I can get rid of the singlemrMultiGroupBy 
if statement and the code within the block because it produces the same result 
that my new code would except that the new code can handle filters as well.

After removing that code, the only remaining code above the if statement will 
be the poorly named getCommonDistinctExprs (as it only returns the common 
distinct expressions provided a lot of conditions are met including a 
requirement that all the distinct expressions are common), which I should be 
able to modify to use my new code.

 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-22 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175201#comment-13175201
 ] 

Phabricator commented on HIVE-2621:
---

heyongqiang has commented on the revision HIVE-2621 [jira] Allow multiple 
group bys with the same input data and spray keys to be run on the same 
reducer..

  can u add a testcase which includes a subquery in one group by clause?
   still reviewing


INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6290 the 
check here is confusing. It is not very clear which cases should come in, and 
which cases should not. Can u try to reduce the size of check list by moving 
some up?

  or add some comments here

REVISION DETAIL
  https://reviews.facebook.net/D567


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-22 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175205#comment-13175205
 ] 

Phabricator commented on HIVE-2621:
---

heyongqiang has commented on the revision HIVE-2621 [jira] Allow multiple 
group bys with the same input data and spray keys to be run on the same 
reducer..

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6284 can u 
add here comments that you just explained to me offline?

REVISION DETAIL
  https://reviews.facebook.net/D567


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-22 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175209#comment-13175209
 ] 

Phabricator commented on HIVE-2621:
---

heyongqiang has commented on the revision HIVE-2621 [jira] Allow multiple 
group bys with the same input data and spray keys to be run on the same 
reducer..

  otherwise, looks good to me.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3015 
Instead of a lot duplicate code here, can we just pass one dest to  
genGroupByPlanReduceSinkOperator()?

REVISION DETAIL
  https://reviews.facebook.net/D567


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-21 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174671#comment-13174671
 ] 

Phabricator commented on HIVE-2621:
---

njain has commented on the revision HIVE-2621 [jira] Allow multiple group bys 
with the same input data and spray keys to be run on the same reducer..

INLINE COMMENTS
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q:12 
This does not look right.

  We would like to make hive.multigroupby.singlereducer as true by default.

  But, we are un-necessarily generating 3 MR jobs for this query (with no 
distinct). I think, we can get it in 2 MR jobs today (not 100% sure)
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6273 It 
would be good to merge the code path with the above if block

  (optimizeMultiGroupBy).

  The common distinct expression should return the common distinct
  checking for the parameter HIVEMULTIGROUPBYSINGLEREDUCER.

  Or, it might be simpler to remove the above if block (the 
optimizeMultiGroupby should be covered by this block).
  Anyway, the above if block (6253-6272) seems broken
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6211 I 
think this code can be simplified.

  The function getCommonDistinctExprs can be removed

REVISION DETAIL
  https://reviews.facebook.net/D567


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-20 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173907#comment-13173907
 ] 

Phabricator commented on HIVE-2621:
---

njain has commented on the revision HIVE-2621 [jira] Allow multiple group bys 
with the same input data and spray keys to be run on the same reducer..

INLINE COMMENTS
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q:12 
A general comment for all the tests.

  Instead of loading 500 rows to src - create a new table - load some 10 rows 
of src into it, and use that for all tests.

  The test outputs are really long, and difficult to review

REVISION DETAIL
  https://reviews.facebook.net/D567


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
 HIVE-2621.D567.2.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-01 Thread Kevin Wilfong (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161369#comment-13161369
 ] 

Kevin Wilfong commented on HIVE-2621:
-

diff is here https://reviews.facebook.net/D567

 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

2011-12-01 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161371#comment-13161371
 ] 

Phabricator commented on HIVE-2621:
---

kevinwilfong has commented on the revision HIVE-2621 [jira] Allow multiple 
group bys with the same input data and spray keys to be run on the same 
reducer..

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6296 The 
code in this method should be the same as what followed the code to generate a 
group by plan in the existing code.  The diff just didn't seem to match them up.
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:5693 The 
string was not being used in this method.

REVISION DETAIL
  https://reviews.facebook.net/D567


 Allow multiple group bys with the same input data and spray keys to be run on 
 the same reducer.
 ---

 Key: HIVE-2621
 URL: https://issues.apache.org/jira/browse/HIVE-2621
 Project: Hive
  Issue Type: New Feature
Reporter: Kevin Wilfong
Assignee: Kevin Wilfong
 Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch


 Currently, when a user runs a query, such as a multi-insert, where each 
 insertion subclause consists of a simple query followed by a group by, the 
 group bys for each clause are run on a separate reducer.  This requires 
 writing the data for each group by clause to an intermediate file, and then 
 reading it back.  This uses a significant amount of the total CPU consumed by 
 the query for an otherwise simple query.
 If the subclauses are grouped by their distinct expressions and group by 
 keys, with all of the group by expressions for a group of subclauses run on a 
 single reducer, this would reduce the amount of reading/writing to 
 intermediate files for some queries.
 To do this, for each group of subclauses, in the mapper we would execute a 
 the filters for each subclause 'or'd together (provided each subclause has a 
 filter) followed by a reduce sink.  In the reducer, the child operators would 
 be each subclauses filter followed by the group by and any subsequent 
 operations.
 Note that this would require turning off map aggregation, so we would need to 
 make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira