Sahil Takiar created HIVE-15114:
-----------------------------------

             Summary: Remove extra MoveTask operators
                 Key: HIVE-15114
                 URL: https://issues.apache.org/jira/browse/HIVE-15114
             Project: Hive
          Issue Type: Sub-task
          Components: Hive
    Affects Versions: 2.1.0
            Reporter: Sahil Takiar


When running simple insert queries (e.g. {{INSERT INTO TABLE ... VALUES ...}}) 
there an extraneous {{MoveTask}s is created.

This is problematic when the scratch directory is on S3 since renames require 
copying the entire dataset.

For simple queries (like the one above), there are two MoveTasks. The first one 
moves the output data from one file in the scratch directory to another file in 
the scratch directory. The second MoveTask moves the data from the scratch 
directory to its final table location.

The first MoveTask should not be necessary. The goal of this JIRA it to remove 
it. This should help improve performance when running on S3.

It seems that the first Move might be caused by a dependency resolution problem 
in the optimizer, where a dependent task doesn't get properly removed when the 
task it depends on is filtered by a condition resolver.

A dummy {{MoveTask}} is added in the 
{{GenMapRedUtils.createMRWorkForMergingFiles}} method. This method creates a 
conditional task which launches a job to merge tasks at the end of the file. At 
the end of the conditional job there is a MoveTask.

Even though Hive decides that the conditional merge job is no needed, it seems 
the MoveTask is still added to the plan.

Seems this extra {{MoveTask}} may have been added intentionally. Not sure why 
yet. The {{ConditionalResolverMergeFiles}} says that one of three tasks will be 
returned: move task only, merge task only, merge task followed by a move task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to