[ https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789294#action_12789294 ]
Ankit Modi commented on PIG-1106: --------------------------------- Tests I ran were using two files file format f1: random chararray(100) f2: random int leftside file contained 100 tuples and right side file contain 3million tuples. Code {noformat} A = load 'leftsidefrjoin.txt' as ( key, value); B = load 'rightsidefrjoin.txt' as (key, value); C = join A by key left, B by key using "repl"; --- Fragmented input and replicated input store C into 'output'; {noformat} This generated following error {noformat} FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.<init>(ArrayList.java:112) at org.apache.pig.data.DefaultTuple.<init>(DefaultTuple.java:63) at org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:369) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:288) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.setUpHashMap(POFRJoin.java:351) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.getNext(POFRJoin.java:211) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:250) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:241) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) {noformat} I ran the same job with same records on left hand side and 100K records on right hand side. The job completed successfully. > FR join should not spill > ------------------------ > > Key: PIG-1106 > URL: https://issues.apache.org/jira/browse/PIG-1106 > Project: Pig > Issue Type: Bug > Reporter: Olga Natkovich > Assignee: Ankit Modi > Fix For: 0.7.0 > > Attachments: frjoin-nonspill.patch > > > Currently, the values for the replicated side of the data are placed in a > spillable bag (POFRJoin near line 275). This does not make sense because the > whole point of the optimization is that the data on one side fits into > memory. We already have a non-spillable bag implemented > (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And > of course need to do lots of testing to make sure that we don't spill but die > instead when we run out of memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.