[jira] Commented: (PIG-129) need to create temp files in the task's working directory

Pi Song (JIRA) Sun, 02 Mar 2008 05:01:57 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574215#action_12574215
 ]


Pi Song commented on PIG-129:
-----------------------------

Olga,
I want to clarify a bit more about what I think and *I really need you opinion* 
on this bit. Regarding temp file creation due to DataBag spill,  this can 
happen in 2 places:-
- In Hadoop Map Reduce execution engine
- In Local execution engine

I agree with you that the working dir mechanism in hadoop is already good and 
you're trying to adopt it *BUT* what about local execution engine? 

I think even most people pay more attention on Hadoop backend and that's where 
Pig started, but the local engine still has its use.

A sample use case would be if I have a big data file on my harddisk(thus cannot 
be too big) and what I do is I just download Pig and then quickly write a pig 
script to perform processing in my local machine using local execution engine 
(without running Hadoop)

A good local engine implementation will help improve usability of Pig!!!

Can we handle this issue in 2 different ways? One for hadoop backend, one for 
local engine. I'm willing to implement what I've proposed in the last comment 
for the local engine.

> need to create temp files in the task's working directory
> ---------------------------------------------------------
>
>                 Key: PIG-129
>                 URL: https://issues.apache.org/jira/browse/PIG-129
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Amir Youssefi
>
> Currently, pig creates temp data such is spilled bags in the directory 
> specified by java.io.tmpdir. The problem is that this directory is usually 
> shared by all tasks and can easily run out of space.
> A better approach would be to create this files in the temp dir inside of the 
> taks working directory as these locations usually have much mor space and 
> also they can be hosted on different disks so the performance could be better.
> There are 2 parts to this fix:
> (1) in org.apache.pig.data.DataBag to check if the temp directory exists and 
> create it if not before trying to create the temp file. This is somewhere 
> around line 390 in the code.
> (2) Change the mapred.child.java.opts in hadoop-site.xml to include new value 
> for tmpdir property to point to ./tmp. For instance: 
> <property>
>         <name>mapred.child.java.opts</name>
>         <value>-Xmx1024M -Djava.io.tmpdir="./tmp"</value>
>         <description>arguments passed to child jvms</description>
> </property>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-129) need to create temp files in the task's working directory

Reply via email to