[Pig Wiki] Update of "PigFRJoin" by ShravanNarayanamurthy

Apache Wiki Wed, 03 Dec 2008 02:44:13 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by ShravanNarayanamurthy:
http://wiki.apache.org/pig/PigFRJoin

------------------------------------------------------------------------------
  Fragment Replicate Join(FRJ) is useful when we want a join between a huge 
table and a very small table (fitting in memory small) and the join doesn't 
expand the data by much. The idea is to distribute the processing of the huge 
files by fragmenting it and replicating the small file to all machines 
receiving a fragment of the huge file. Because of the availability of the 
entire small file, the join becomes a trivial task without needing any break in 
the pipeline.
  
  == Performance ==
+ The following is a set of parameters that we can alter to compare the 
performance of the different types of join algorithms:
     1. Query being compared
        1. {{{
  A = load 'frag';
@@ -47, +48 @@

  }}}
  
  == Experiments ==
- We compare the times of FRJ implemented as a new operator with Symmetric Hash 
Join (the normal map reduce join) and a UDF implementation of FRJ. The changes 
to the logical side are as per JoinFramework. We differentiate the joins as 
those where the join result is larger than its input(Expanding Join) & those 
where its lesser(Reducing Join). The following graphs show the performance of 
the various algorithms:
+ We compare the times of FRJ implemented as a new operator with Symmetric Hash 
Join (the normal map reduce join) and a UDF implementation of FRJ. The changes 
to the logical side are as per JoinFramework. We differentiate the joins as 
those where the join result is larger than its input(Expanding Join) & those 
where its lesser(Reducing Join). Each case is associated with a set of numbers 
(1.2, 2.1.*.2,etc) which tell us the above mentioned parameters representing 
the case. The following graphs show the performance of the various algorithms:
+ 
+ || '''Experiment 1: Reducing Join (1.2, 2.1.*.2, 3.1, 4.1, 5.2)''' || 
'''Experiment 2: Expanding Join (1.2, 2.2.*.1, 3.1, 4.1, 5.2)''' ||
+ || attachment:GrpTimes-off.png || attachment:exp-grp-times.png ||
+ || '''Experiment 3: Utilization: (1.2, 2.1.*.2, 3.1, 4.1, 5.2)''' || 
'''Experiment 4: Sorted Bag (1.2, 2.1.*.2, 3.1, 4.1, 5.2)''' ||
+ || We measure the utilization of the cluster by the various algorithms by 
running 10 homogenous jobs simultaneously and calculating the number of jobs 
same sized clusters can run in a minute for the different algorithms. The 
following graphs give the results of the experiments ran. || One serious 
limitation of FRJ is that it tries to read the replicated table into memory. If 
the file is even slightly bigger, it dies with out of memory exception. In 
order to work around this problem, we can read the replicated tables and also 
the fragment of the fragmented table into Sorted Bags, which are disk-backed 
structures, and perform a merge join. However, from the graphs below it doesn't 
seem like a viable alternative ||
+ || attachment:util.png || attachment:bagjoin.png ||
  
  === UDF used ===
  {{{
@@ -110, +117 @@

      }
  }}}
  
+ === Code ===
+ FRJ has been submitted as a patch. The jira following this issue is 
https://issues.apache.org/jira/browse/PIG-554.
- || '''Experiment 1: Reducing Join (1.2, 2.1.*.2, 3.1, 4.1, 5.2)''' || 
'''Experiment 2: Expanding Join (1.2, 2.2.*.1, 3.1, 4.1, 5.2)''' ||
- || attachment:GrpTimes-off.png || attachment:exp-grp-times.png ||
- || '''Experiment 3: Utilization: (1.2, 2.1.*.2, 3.1, 4.1, 5.2)''' || 
'''Experiment 4: Sorted Bag (1.2, 2.1.*.2, 3.1, 4.1, 5.2)''' ||
- || We measure the utilization of the cluster by the various algorithms by 
running 10 homogenous jobs simultaneously and calculating the number of jobs 
same sized clusters can run in a minute for the different algorithms. The 
following graphs give the results of the experiments ran. || One serious 
limitation of FRJ is that it tries to read the replicated table into memory. If 
the file is even slightly bigger, it dies with out of memory exception. In 
order to work around this problem, we can read the replicated tables and also 
the fragment of the fragmented table into Sorted Bags, which are disk-backed 
structures, and perform a merge join. However, from the graphs below it doesn't 
seem like a viable alternative ||
- || attachment:util.png || attachment:bagjoin.png ||

[Pig Wiki] Update of "PigFRJoin" by ShravanNarayanamurthy

Reply via email to