[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

Hadoop QA (JIRA) Mon, 16 Nov 2009 21:32:07 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778732#action_12778732
 ]


Hadoop QA commented on PIG-872:
-------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12425174/PIG_872.patch
  against trunk revision 881008.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/console

This message is automatically generated.

> use distributed cache for the replicated data set in FR join
> ------------------------------------------------------------
>
>                 Key: PIG-872
>                 URL: https://issues.apache.org/jira/browse/PIG-872
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_872.patch
>
>
> Currently, the replicated file is read directly from DFS by all maps. If the 
> number of the concurrent maps is huge, we can overwhelm the NameNode with 
> open calls.
> Using distributed cache will address the issue and might also give a 
> performance boost since the file will be copied locally once and the reused 
> by all tasks running on the same machine.
> The basic approach would be to use cacheArchive to place the file into the 
> cache on the frontend and on the backend, the tasks would need to refer to 
> the data using path from the cache.
> Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
> us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

Reply via email to