Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/JoinOptimization" page has been changed by LiyinTang. http://wiki.apache.org/hadoop/Hive/JoinOptimization?action=diff&rev1=2&rev2=3 -------------------------------------------------- = 1. Map Join Optimization = == 1.1 Using Distributed Cache to Propagate Hashtable File == + Previously, when 2 large data tables need to do a join, there will be 2 different Mappers to sort these tables based on the join key and emit an intermediate file, and the Reducer will take the intermediate file as input file and do the real join work. This join mechanism is perfect with two large data size. But if one the join table is small enough to fit into the Mapper’s memory, then there is no need to launch the Reducer. Actually, the Reducer stage is very expensive for the performance because the Map/Reduce framework needs to sort and merge the intermediate files. - Previously, when 2 large data tables need to do a join, there will be 2 different Mappers to sort these tables based on the join key and emit an intermediate file, and the Reducer will take the intermediate file as input file and do the real join work. This join mechanism is perfect with two large data size. But if one the join table is small enough to fit into the Mapper’s memory, then there is no need to launch the Reducer. Actually, the Reducer stage is very expensive for the performance because the Map/Reduce framework needs to sort and merge the intermediate files. + {{None}} So the basic idea of map join is to hold the data of small table in Mapper’s memory and do the join work in Map stage, which saves the Reduce stage. As shown as Fig 1, the previous map join operation is not scale for large data because each Mapper will directly read the small table data from HDFS. If the large data file is large enough, there will be thousands of Mapper launched to read different record of this large data file. And thousands of Mapper will read this small table data from HDFS into the memory, which makes the small table to be the performance bottleneck, or sometimes Mapper will get lots of time-out for reading this small file, which may cause the task failed.
