[ 
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789294#action_12789294
 ] 

Ankit Modi commented on PIG-1106:
---------------------------------

Tests I ran were using two files

file format
f1: random chararray(100)
f2: random int

leftside file contained 100 tuples and right side file contain 3million tuples.

Code
{noformat}
A = load 'leftsidefrjoin.txt' as ( key, value);
B = load 'rightsidefrjoin.txt' as (key, value);
C = join A by key left, B by key using "repl";
--- Fragmented input and replicated input
store C into 'output';
{noformat}

This generated following error
{noformat}
FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : 
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.ArrayList.<init>(ArrayList.java:112)
        at org.apache.pig.data.DefaultTuple.<init>(DefaultTuple.java:63)
        at 
org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:369)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:288)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.setUpHashMap(POFRJoin.java:351)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.getNext(POFRJoin.java:211)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:250)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:241)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
{noformat}

I ran the same job with same records on left hand side and 100K records on 
right hand side. The job completed successfully.

> FR join should not spill
> ------------------------
>
>                 Key: PIG-1106
>                 URL: https://issues.apache.org/jira/browse/PIG-1106
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Ankit Modi
>             Fix For: 0.7.0
>
>         Attachments: frjoin-nonspill.patch
>
>
> Currently, the values for the replicated side of the data are placed in a 
> spillable bag (POFRJoin near line 275). This does not make sense because the 
> whole point of the optimization is that the data on one side fits into 
> memory. We already have a non-spillable bag implemented 
> (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And 
> of course need to do lots of testing to make sure that we don't spill but die 
> instead when we run out of memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to