[
https://issues.apache.org/jira/browse/HADOOP-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566261#action_12566261
]
Allen Wittenauer commented on HADOOP-2735:
------------------------------------------
For the most part, I agree with Pi's comments.
Koji and I just had a quick discussion about this and I think we've come up
with a good idea. Now we want to toss it to the wolves. :)
Quick summary of the issue as I understand it:
1) We have applications that depend upon java.io.tmp properties to be set.
2) These applications may be independently/inadvertently writing data to the
same place. If this data is large, there may be a disk overflow issue. On
UNIX, this may have dire consequences (/tmp being either on / or be in swap)
3) Hard coding is generally bad, as it makes assumptions about task behavior
and file system layout. In particular, ./tmp is bad because, it makes the
assumption that the task hasn't changed cwd itself.
So this is what we propose:
We create a new Hadoop property called mapred.child.tmp. This property takes
three values:
default == we leave java.io.tmp alone
dynamic == we dynamically calculate the full path of our mapred task
directories tmp dir (the end result would be the equivalent of ./tmp, except
that instead of depending upon '.', it would be the actual path to where mapred
normally cwd's to.. mapred.local.dir/blah/blah/blah/.../tmp .)
anything else == a path provided by the user
With this type of change, we can cover a wide variety of cases, such as
applications that assume that io.tmp is the same across all tasks, applications
that require separate io.tmp's across all tasks, gives ops the benefit of being
able to 'spread the load' across multiple drives, etc.
Thoughts?
> Setting default tmp directory for java createTempFile (java.io.tmpdir)
> ----------------------------------------------------------------------
>
> Key: HADOOP-2735
> URL: https://issues.apache.org/jira/browse/HADOOP-2735
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Reporter: Koji Noguchi
> Assignee: Amareshwari Sri Ramadasu
> Priority: Critical
> Fix For: 0.16.1
>
> Attachments: patch-2735.txt
>
>
> On our cluster, we've seen Pig(http://incubator.apache.org/pig/) filling up
> the /tmp and failing.
> (also inefficient since all the local tasks were spilling to the same disk)
> Pig is simply using java api createTempFile,
> http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String,%20java.io.File
> Can we add -Djava.io.tmpdir="./tmp" somewhere ?
> so that,
> 1) Tasks can utilize all disks when using tmp
> 2) Any undeleted tmp files will be deleted by the tasktracker when task(job?)
> is done.
> The easiest way is to set it inside mapred.child.java.opts in the config, but
> this can be overwritten if the users set their own task heapsize.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.