Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/JoinOptimization" page has been changed by LiyinTang. http://wiki.apache.org/hadoop/Hive/JoinOptimization?action=diff&rev1=5&rev2=6 -------------------------------------------------- {{attachment:fig2.jpg||height="881px",width="1184px"}} + '''Fig 2. The Optimized Map Join''' + Hive-1641 ([[http://issues.apache.org/jira/browse/HIVE-1293|http://issues.apache.org/jira/browse/HIVE-1641]]) has solved this problem. As shown in Fig2, the basic idea is to create a new task, MapReduce Local Task, before the orginal Join Map/Reduce Task. This new task will read the small table data from HDFS to in-memory hashtable. After reading, it will serialize the in-memory hashtable into files on disk and compress the hashtable file into a tar file. In next stage, when the MapReduce task is launching, it will put this tar file to Hadoop Distributed Cache, which will populate the tar file to each Mapper’s local disk and decompress the file. So all the Mappers can deserialize the hashtable file back into memory and do the join work as before. == 1.2 Removing JDBM == Previously, Hive uses JDBM ([[http://issues.apache.org/jira/browse/HIVE-1293|http://jdbm.sourceforge.net/]]) as a persistent hashtable. Whenever the in-memory hashtable cannot hold data any more, it will swap the key/value into the JDBM table. However when profiing the Map Join, we found out this JDBM component takes more than 70 % CPU time as shown in Fig3. Also the persistent file JDBM genreated is too large to put into the Distributed Cache. For example, if users put 67,000 simple interger key/value pairs into the JDBM, it will generate more 22M hashtable file. So the JDBM is too heavy weight for Map Join and it would better to remove this componet from Hive. Map Join is designed for holding the small table's data into memory. If the table is too large to hold, just run as a Common Join. There is no need to use persistent hashtable any more. + {{attachment:fig3.jpg}} + + '''Fig 3. The Profiling Result of JDBM<<BR>>''' + == 1.3 Performance Evaluation == + '''Table 1: The Comparison between the previous map join with the new optimized map join''' + + {{attachment:fig4.jpg}} + + As shown in Table1, the optmized map join will be 12 ~ 26 times faster than the previous one. Most of map join performance improvement comes from removing the JDBM component. + = 2. Converting Join into Map Join dyanmically = - == 2.1 Join Exeuction Flow == + == 2.1 New Join Exeuction Flow == == 2.2 Resolving the Join Operation at Run Time == == 2.3 Backup Task == == 2.4 Performance Evaluation ==
