[ 
https://issues.apache.org/jira/browse/PIG-4420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320325#comment-14320325
 ] 

Rohini Palaniswamy edited comment on PIG-4420 at 2/13/15 4:02 PM:
------------------------------------------------------------------

Thanks [[email protected]]. That is a very nice workaround. But still this 
is nice to have as maintaining a list (avoiding cost of construction of 
hashmap) and not doing rearrange of tuples to get key,value for replicate join 
will cut down an lot of overhead and boost performance a lot.  Replicate join 
itself needs a revisit for performance as the SchemaTuple stuff seems to be 
adding more memory overhead. Found PIG-865 recently which is another waste for 
replicate join. Also Hive folks were telling that they have reduced the data 
structures used in their map side join with Tez and it is far more efficient, 
but I haven't got around to looking into it.

The replicate join workaround will run with parallelism of number of splits in 
A. To speed up the CROSS, we also did set the value of below settings to less 
than 128MB to increase the parallelism by increasing the number of splits in A.

mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
pig.maxCombinedSplitSize


was (Author: rohini):
Thanks [[email protected]]. That is a very nice workaround. But still this 
is nice to have as maintaining a list (avoiding cost of construction of 
hashmap) and not doing rearrange of tuples to get key,value for replicate join 
will cut down an lot of overhead and boost performance a lot.  Replicate join 
itself needs a revisit for performance as the SchemaTuple stuff seems to be 
adding more memory overhead. Found PIG-865 recently which is a waste for 
replicate join. Also Hive folks were also telling that they have reduced the 
data structures used in their map side join with Tez and it is far more 
efficient, but I haven't got around to looking into it.

The replicate join workaround will run with parallelism of number of splits in 
A. To speed up the CROSS, we also did set the value of below settings to less 
than 128MB to increase the parallelism by increasing the number of splits in A.

mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
pig.maxCombinedSplitSize

> Support for map side cross similar to replicate join
> ----------------------------------------------------
>
>                 Key: PIG-4420
>                 URL: https://issues.apache.org/jira/browse/PIG-4420
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Rohini Palaniswamy
>
>    Our CROSS implementation is very costly.  Recently had a case where a user 
> was doing a CROSS of 30million records against 3K records and it caused lot 
> of disk error exceptions during the shuffle phase. We need to add support for 
> a map side cross syntax
> C = CROSS A, B using 'replicate';
> The smaller table can be loaded in a list (hashmap in replicate join) and 
> iterated through for each record in the bigger table. It should give a major 
> performance boost and drastically reduce the resource usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to