[jira] Commented: (PIG-116) pig leaves temp files behind

Olga Natkovich (JIRA) Fri, 22 Feb 2008 14:36:14 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12571620#action_12571620
 ]


Olga Natkovich commented on PIG-116:
------------------------------------

The following config params in hadoop tells if trash is enabled and where:

<property>
  <name>fs.trash.root</name>
  <value>${hadoop.tmp.dir}/Trash</value>
  <description>The trash directory, used by FsShell's 'rm' command.
  </description>
</property>

<property>
  <name>fs.trash.interval</name>
  <value>0</value>
  <description>Number of minutes between trash checkpoints.
  If zero, the trash feature is disabled.
  </description>
</property>

The format of directories to create is yyMMddHHmm.

> pig leaves temp files behind
> ----------------------------
>
>                 Key: PIG-116
>                 URL: https://issues.apache.org/jira/browse/PIG-116
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Olga Natkovich
>
> Currently, pig creates temp dirs via call to FileLocalizer.getTemporaryPath. 
> They are created on the client and are mainly used to store data between 2 
> M-R jobs. Pig then attempts to clean them up in the client's shutdown hook. 
> The problem with this approach is that, because there is now way to order the 
> shutdown hooks, in some cases, the DFS is already closed when we try to 
> delete the files in which case a substention amount of data can be left in 
> DFS. I see this issue more frequently with hadoop 0.16 perhaps because I had 
> to add an extra shutdown hook to handle hod disconnects.
> The short term, I would like to propose the approach below:
> (1) If trash is configured on the cluster, use trash location to create temp 
> directory that will expire in 7 days. The hope is that most jobs don't run 
> longer that 7 days. The user can specify a longer interval via a command line 
> switch
> (2) If trash is not enabled on the cluster, the location that we use now will 
> be used
> (3) In the shutdown hook, we will attempt to cleanup. If the attempt fails 
> and trash is enabled, we let trash handle it; otherwise we provide the list 
> of locations to the user to clean. (I realize that this is not ideal but 
> could not figure out a better way.)
> Longer term, I am talking with hadoop team to have better temp file support: 
> https://issues.apache.org/jira/browse/HADOOP-2815
> Comments? Suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-116) pig leaves temp files behind

Reply via email to