[ 
https://issues.apache.org/jira/browse/PIG-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637663#comment-13637663
 ] 

Cheolsoo Park commented on PIG-3169:
------------------------------------

[~mwagner], thank you very much for updating the patch. I have a few comments:
# Since you're deleting temp files after each job in MRLauncher, Main and 
PigServer no longer need to call {{FileLocalizer.deleteTempFiles()}} before 
exit, do they? You mentioned "leaving other temporary files alone" in your 
comment. I guess you mean temporary files that the user specified as output. 
For example,
{code}
Store A1 into '" + FileLocalizer.getTemporaryPath(pigContext) + "';"
{code}
These temporary files should be deleted by the user (or unit tests) explicitly. 
Are there any other temporary files that Pig needs to clean up? If not, would 
you like to remove {{FileLocalizer.deleteTempFiles()}} from Main and PigServer 
since they're redundant?
# In addition, can you think about how to update FileLocalizer? Currently, 
FileLocalizer maintains a queue to keep track of temp files generated by 
{{FileLocalizer.getTemporaryPath()}}. When {{FileLocalizer.deleteTempFiles()}} 
is called, they will be poped off and deleted. With your patch, this queue is 
going to be out-of-sync since {{FileLocalizer.delete()}} does not update the 
queue. So when {{FileLocalizer.deleteTempFiles()}} is called, quite a few 
IOExceptions will be caught in the following code for non-existent files:
{code}
public static void deleteTempFiles() {
    while (!toDelete().isEmpty()) {
        try {
            ElementDescriptor elem = toDelete().pop();
            elem.delete();
        } 
        catch (IOException e) {
            log.error(e);
        }
    }
    setInitialized(false);
}
{code}
{{elem.delete()}} is essentially a RPC call to NN, so I find it wasteful to 
make such a call when we already know the file doesn't exist. In fact, I 
thought about whether we can completely eliminate this queue. But I think it's 
still convenient to have it because unit tests use it clean up multiple temp 
files. Probably, we replace the queue with a map and make 
{{FileLocalizer.delete()}} update it while deleting files? Thoughts?
# Regarding MRIntermediateDataVisitor, why not delete temp files when it's 
visiting stores instead of returning the list of files and running an extra 
iteration to delete them? Wouldn't the former be more efficient?
# If we do 3, can we call the visitor StoreCleaner instead of StoreFinder? 
You're using both names now. It would be good to use a single name.
{code}
+        if (mr.mapPlan != null && mr.mapPlan.size() > 0) {
+            StoreFinder finder = new StoreFinder(mr.mapPlan);
+            finder.visit();
+        }
+        if (mr.combinePlan != null && mr.combinePlan.size() > 0) {
+            StoreFinder cleaner = new StoreFinder(mr.combinePlan);
+            cleaner.visit();
+        }
+        if (mr.reducePlan != null && mr.reducePlan.size() > 0) {
+            StoreFinder cleaner = new StoreFinder(mr.reducePlan);
+            cleaner.visit();
+        }
{code}

Please feel free to disagree with me.
                
> Remove intermediate data after a job finishes
> ---------------------------------------------
>
>                 Key: PIG-3169
>                 URL: https://issues.apache.org/jira/browse/PIG-3169
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Mark Wagner
>            Assignee: Mark Wagner
>            Priority: Minor
>             Fix For: 0.12
>
>         Attachments: PIG-3169.1.patch, PIG-3169.2.patch, PIG-3169.3.patch, 
> PIG-3169-hotfix.patch
>
>
> When using Grunt, intermediate data and distributed caches files are left in 
> 'pig.temp.dir' until the session is closed. It would be nice to cleanup files 
> as they are no longer needed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to