Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by AshutoshChauhan: http://wiki.apache.org/pig/PigMergeJoin ------------------------------------------------------------------------------ inputs to a merge join. If users wish to do three plus way joins with this algorithm they can decompose their joins into a series of two ways joins. ---- + We benchmarked the performance of merge-join. Numbers are in table below. + Since joins produce large number of output rows(n-squared in worst case) they are filtered out and not written to disk, so runtimes consist of mostly of CPU time. + Data size is approximately 1.5 GB for 1M rows. + == Performance Benchmark == - ||# of rows || # of rows || Sym-Hash Join || Merge-Join|| + ||# of rows || # of rows || Sym-Hash Join || Merge-Join||Remarks|| ||100K||100K|| 1m30.588s|| 0m41.764s|| ||5M || 5M|| 1m5.579s|| 0m51.535s|| ||15M|| 15M|| 2m55.812s|| 2m16.906s|| @@ -118, +122 @@ ||20M|| 20M|| 10m30.819s|| 5m42.302s|| ||20M|| 20M|| 12m25.878s|| 5m42.302s|| ||20M|| 20M|| 15m56.533s|| 7m57.646s|| - ||20M || 20M|| 112m20.478s||17m16.542s|| + ||20M|| 20M||112m20.478s||17m16.542s||Zipf Key Distribution|| - ||100M || 20M|| 30mins || 15mins || + ||100M|| 20M|| 30mins || 15mins || - ||50M || 100M|| 42mins ||28mins|| + ||50M || 100M|| 42mins || 28mins|| - ||50M || 100M|| 41mins, 56sec|| 28mins|| - ||50M || 100M|| 42mins|| 28mins || + ||50M || 100M|| 42mins || 28mins|| + ||50M || 100M|| 42mins || 28mins || - ||100M ||100M|| 74mins|| 44mins|| + ||100M|| 100M|| 74mins || 44mins|| + In our tests, we found merge-join is faster then symmetric hash join anywhere from 30% to 100% depending on size of data, skew in keys, number of map waves run etc. + One particular thing which affects merge join is how is input data distributed among files. We recommend to have all input files to be approximately equal in size or atleast as large as block size. ---- == Phase 2 ==