[jira] [Updated] (HIVE-3733) Improve Hive's logic for conditional merge

Pradeep Kamath (JIRA) Thu, 06 Dec 2012 18:39:24 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Pradeep Kamath updated HIVE-3733:
---------------------------------

    Attachment: HIVE-3733.4.patch.txt

Changed the code which looks in the Operator Stack to look for 
ReduceSinkOperator instead of the exact CurrWork.getReducer() instance.

union19 no longer performs a conditional merge with this change. My hypothesis 
for this follows:

The union19 query is:
FROM (select 'tst1' as key, cast(count(1) as string) as value from src s1
UNION ALL
select s2.key as key, s2.value as value from src s2) unionsrc
INSERT OVERWRITE TABLE DEST1 SELECT unionsrc.key, count(unionsrc.value) group 
by unionsrc.key
INSERT OVERWRITE TABLE DEST2 SELECT unionsrc.key, unionsrc.value, 
unionsrc.value;

The from subquery has an implicit group by/ReduceSink due to the count. So 
though the second insert in the multi insert by itself does not have a 
groupby/ReduceSink, the subquery in the from clause causes the 
groupby/ReduceSink to appear in the stack and hence we decide not to do the 
conditional merge since the FileSink will be in the reduce.
                
> Improve Hive's logic for conditional merge
> ------------------------------------------
>
>                 Key: HIVE-3733
>                 URL: https://issues.apache.org/jira/browse/HIVE-3733
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: HIVE-3733.1.patch.txt, HIVE-3733.3.patch.txt, 
> HIVE-3733.4.patch.txt
>
>
> If the config hive.merge.mapfiles is set to true and hive.merge.mapredfiles 
> is set to false then when hive encounters a FileSinkOperator when generating 
> map reduce tasks, it will look at the entire job to see if it has a reducer, 
> if it does it will not merge. Instead it should be check if the 
> FileSinkOperator is a child of the reducer. This means that outputs generated 
> in the mapper will be merged, and outputs generated in the reducer will not 
> be, the intended effect of setting those configs.
> Simple repro:
> set hive.merge.mapfiles=true;
> set hive.merge.mapredfiles=false;
> EXPLAIN
> FROM <input_table>
> INSERT OVERWRITE TABLE <output_table1> SELECT key, COUNT(*) group by key
> INSERT OVERWRITE TABLE <output_table2> SELECT *;
> The output should contain a Conditional Operator, Mapred Stages, and Move 
> tasks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3733) Improve Hive's logic for conditional merge

Reply via email to