[ 
https://issues.apache.org/jira/browse/PIG-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13582696#comment-13582696
 ] 

Cheolsoo Park commented on PIG-3169:
------------------------------------

{quote}
they are going back to read intermediate data from the first job and failing 
because it has been deleted after the previous store
{quote}
Please correct me if I am wrong, but I have a sightly different understanding. 
If you print out what's causing failure in the test, that's not the 
intermediate output file from a previous job but the input file. Here is what I 
did to verify it:
{code}
String fn = Util.generateURI(tmpFile.toString(), pig.getPigContext());
System.out.println("blah: "+ fn);
pig.registerQuery("A = LOAD '" + fn + "';");
pig.registerQuery("Split A into A1 if $0<=10, A2 if $0>10;");
pig.registerQuery("Store A1 into '" + 
FileLocalizer.getTemporaryPath(pigContext) + "';");
pig.openIterator("A2");
{code}
Now here is the test log:
{code}
blah: hdfs://localhost.localdomain:45150/tmp/temp2039910329/tmp-2001898725
...
ERROR 2118: Input path does not exist: 
hdfs://localhost.localdomain:45150/tmp/temp2039910329/tmp-2001898725
{code}
As can be seen, the input file (/tmp/temp2039910329/tmp-2001898725) is deleted 
after the 1st job, and that makes the 2nd job fail. In fact, I can verify this 
by doing explain on A1 and A2 as well. For example, if I do the following in 
Pig,
{code}
A = LOAD '1.txt';
Split A into A1 if $0<=10, A2 if $0>10;
Explain A1;
Explain A2;
{code}
It gives me the following output, and I can see both A1 and A2 share the same 
input: 
{code}
A1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-7
|
|---A1: Filter[bag] - scope-2
    |   |
    |   Less Than or Equal[boolean] - scope-6
    |   |
    |   |---Cast[int] - scope-4
    |   |   |
    |   |   |---Project[bytearray][0] - scope-3
    |   |
    |   |---Constant(10) - scope-5
    |
    |---A: 
Load(file:///home/cheolsoo/workspace/pig-git-2/1.txt:org.apache.pig.builtin.PigStorage)
 - scope-0--------

...

A2: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-19
|
|---A2: Filter[bag] - scope-14
    |   |
    |   Greater Than[boolean] - scope-18
    |   |
    |   |---Cast[int] - scope-16
    |   |   |
    |   |   |---Project[bytearray][0] - scope-15
    |   |
    |   |---Constant(10) - scope-17
    |
    |---A: 
Load(file:///home/cheolsoo/workspace/pig-git-2/1.txt:org.apache.pig.builtin.PigStorage)
 - scope-12--------
{code}
Sorry for the long message, but I wanted to make sure that we are on the same 
page before deciding how to fix this.
                
> Remove temporary files that are not needed
> ------------------------------------------
>
>                 Key: PIG-3169
>                 URL: https://issues.apache.org/jira/browse/PIG-3169
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Mark Wagner
>            Assignee: Mark Wagner
>            Priority: Minor
>             Fix For: 0.12
>
>         Attachments: PIG-3169.1.patch, PIG-3169-hotfix.patch
>
>
> When using Grunt, intermediate data and distributed caches files are left in 
> 'pig.temp.dir' until the session is closed. It would be nice to cleanup files 
> as they are no longer needed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to