pig leaves temp files behind
----------------------------

                 Key: PIG-116
                 URL: https://issues.apache.org/jira/browse/PIG-116
             Project: Pig
          Issue Type: Bug
            Reporter: Olga Natkovich
            Assignee: Olga Natkovich


Currently, pig creates temp dirs via call to FileLocalizer.getTemporaryPath. 
They are created on the client and are mainly used to store data between 2 M-R 
jobs. Pig then attempts to clean them up in the client's shutdown hook. 

The problem with this approach is that, because there is now way to order the 
shutdown hooks, in some cases, the DFS is already closed when we try to delete 
the files in which case a substention amount of data can be left in DFS. I see 
this issue more frequently with hadoop 0.16 perhaps because I had to add an 
extra shutdown hook to handle hod disconnects.

The short term, I would like to propose the approach below:

(1) If trash is configured on the cluster, use trash location to create temp 
directory that will expire in 7 days. The hope is that most jobs don't run 
longer that 7 days. The user can specify a longer interval via a command line 
switch
(2) If trash is not enabled on the cluster, the location that we use now will 
be used
(3) In the shutdown hook, we will attempt to cleanup. If the attempt fails and 
trash is enabled, we let trash handle it; otherwise we provide the list of 
locations to the user to clean. (I realize that this is not ideal but could not 
figure out a better way.)

Longer term, I am talking with hadoop team to have better temp file support: 
https://issues.apache.org/jira/browse/HADOOP-2815

Comments? Suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to