remove name node calls in hive by creating temporary directories
----------------------------------------------------------------

                 Key: HIVE-2201
                 URL: https://issues.apache.org/jira/browse/HIVE-2201
             Project: Hive
          Issue Type: Improvement
            Reporter: Namit Jain


Currently, in Hive, when a file gets written by a FileSinkOperator,
the sequence of operations is as follows:

1. In tmp directory tmp1, create a tmp file _tmp_1
2. At the end of the operator, move
/tmp1/_tmp_1 to /tmp1/1
3. Move directory /tmp1 to /tmp2
4. For all files in /tmp2, remove all files starting with _tmp and
duplicate files.

Due to speculative execution, a lot of temporary files are created
in /tmp1 (or /tmp2). This leads to a lot of name node calls,
specially for large queries.

The protocol above can be modified slightly:

1. In tmp directory tmp1, create a tmp file _tmp_1
2. At the end of the operator, move
/tmp1/_tmp_1 to /tmp2/1
3. Move directory /tmp2 to /tmp3
4. For all files in /tmp3, remove all duplicate files.

This should reduce the number of tmp files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to