[ https://issues.apache.org/jira/browse/HIVE-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pradeep Kamath updated HIVE-3733: --------------------------------- Attachment: HIVE-3733.5.patch.txt I have attached HIVE-3733.5.patch.txt for review (also added it to differential at https://reviews.facebook.net/D6969) with some changes but essentially implementing the fix for this issue at the physical optimizer level. The code checks if a non reduce FileSinkOperator in a MapRedTask (which is not child of a ConditionTask so we don't go after merge Tasks) can be conditionally merged and uses the code from GenMRFileSink1 to actually introduce the conditional merge. All tests pass besides the two below: testCliDriver_stats19 - This succeeds on my Mac but fails on a linux machine - not quite sure what to make of it. testNegativeCliDriver_stats_aggregator_error_1 produces an error during execution - I am assuming this testcase has been known to be flaky and the error is not due to the current changes Committers, please review carefully to make sure I haven't missed any corner cases and I have left the tasks/plan in a valid state. > Improve Hive's logic for conditional merge > ------------------------------------------ > > Key: HIVE-3733 > URL: https://issues.apache.org/jira/browse/HIVE-3733 > Project: Hive > Issue Type: Improvement > Reporter: Pradeep Kamath > Assignee: Pradeep Kamath > Attachments: HIVE-3733.1.patch.txt, HIVE-3733.3.patch.txt, > HIVE-3733.4.patch.txt, HIVE-3733.5.patch.txt, HIVE-3733.optimizer.patch.txt > > > If the config hive.merge.mapfiles is set to true and hive.merge.mapredfiles > is set to false then when hive encounters a FileSinkOperator when generating > map reduce tasks, it will look at the entire job to see if it has a reducer, > if it does it will not merge. Instead it should be check if the > FileSinkOperator is a child of the reducer. This means that outputs generated > in the mapper will be merged, and outputs generated in the reducer will not > be, the intended effect of setting those configs. > Simple repro: > set hive.merge.mapfiles=true; > set hive.merge.mapredfiles=false; > EXPLAIN > FROM <input_table> > INSERT OVERWRITE TABLE <output_table1> SELECT key, COUNT(*) group by key > INSERT OVERWRITE TABLE <output_table2> SELECT *; > The output should contain a Conditional Operator, Mapred Stages, and Move > tasks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira