[jira] Commented: (PIG-1218) Use distributed cache to store samples

Ashutosh Chauhan (JIRA) Wed, 17 Feb 2010 11:48:51 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834960#action_12834960
 ]


Ashutosh Chauhan commented on PIG-1218:
---------------------------------------

On trunk - patch
In POFRJoin#setUpHashMap()
{code}
POLoad ld = new POLoad(new OperatorKey("Repl File Loader", 1L),
                    replFile, false);
{code}
should it be?
{code}
 POLoad ld = new POLoad(new OperatorKey("Repl File Loader", 
NodeIdGenerator.getGenerator().getNextNodeId("Repl File Loader")),
                    replfile, false);
{code}

Also following can be moved out of for loop to avoid multiple connect() on pc.
{code}
 PigContext pc = new PigContext(ExecType.MAPREDUCE, props);                  
            pc.connect();
{code}

In jobControlCompiler#setupDistributedCacheForFRJoin()
{code}
new FRJoinDistributedCacheVisitor(mro.reducePlan, pigContext, conf)
                .visit();
{code}
Do we need this? Isn't FR Join a map-side join. So, if POFRJoin ends up in 
mro.reducePlan thats a bug, no?


> Use distributed cache to store samples
> --------------------------------------
>
>                 Key: PIG-1218
>                 URL: https://issues.apache.org/jira/browse/PIG-1218
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1218.patch, PIG-1218_2.patch
>
>
> Currently, in the case of skew join and order by we use sample that is just 
> written to the dfs (not distributed cache) and, as the result, get opened and 
> copied around more than necessary. This impacts query performance and also 
> places unnecesary load on the name node

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1218) Use distributed cache to store samples

Reply via email to