[ https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068171#comment-14068171 ]
Binglin Chang commented on MAPREDUCE-2841: ------------------------------------------ Thanks Sean, patch looks good. I have some issue compiling the code on MACOSX, I see the cmake file is mostly copy from hadoop-common(or other sub projects), I compile hadoop-common successfully in my env, but failed for nativetask, so there maybe some issue in CMakefile {code} [copy] Copying 1 file to /Volumes/SSD/projects/hadoop-trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/target/native/test/testData [exec] CMake Error at /usr/local/Cellar/cmake/3.0.0/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:136 (message): [exec] Could NOT -- Configuring incomplete, errors occurred! [exec] See also "/Volumes/SSD/projects/hadoop-trunk/hadoop-mapreduce-project/hadoop-mapredufind JNI (missing: JAVA_AWT_LIBRARY JAVA_JVM_LIBRARY [exec] JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH) [exec] Call Stack (mce-client/hadoop-mapreduce-client-nativetask/target/native/CMakeFiles/CMakeOutput.log". [exec] ost recent call first): [exec] /usr/local/Cellar/cmake/3.0.0/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:343 (_FPHSA_FAILURE_MESSAGE) [exec] /usr/local/Cellar/cmake/3.0.0/share/cmake/Modules/FindJNI.cmake:286 (FIND_PACKAGE_HANDLE_STANDARD_ARGS) [exec] JNIFlags.cmake:117 (find_package) [exec] CMakeLists.txt:24 (include) {code} > Task level native optimization > ------------------------------ > > Key: MAPREDUCE-2841 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task > Environment: x86-64 Linux/Unix > Reporter: Binglin Chang > Assignee: Sean Zhong > Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, > MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, > fb-shuffle.patch, hadoop-3.0-mapreduce-2841-2014-7-17.patch > > > I'm recently working on native optimization for MapTask based on JNI. > The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs > emitted by mapper, therefore sort, spill, IFile serialization can all be done > in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising > results: > 1. Sort is about 3x-10x as fast as java(only binary string compare is > supported) > 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware > CRC32C is used, things can get much faster(1G/ > 3. Merge code is not completed yet, so the test use enough io.sort.mb to > prevent mid-spill > This leads to a total speed up of 2x~3x for the whole MapTask, if > IdentityMapper(mapper does nothing) is used > There are limitations of course, currently only Text and BytesWritable is > supported, and I have not think through many things right now, such as how to > support map side combine. I had some discussion with somebody familiar with > hive, it seems that these limitations won't be much problem for Hive to > benefit from those optimizations, at least. Advices or discussions about > improving compatibility are most welcome:) > Currently NativeMapOutputCollector has a static method called canEnable(), > which checks if key/value type, comparator type, combiner are all compatible, > then MapTask can choose to enable NativeMapOutputCollector. > This is only a preliminary test, more work need to be done. I expect better > final results, and I believe similar optimization can be adopt to reduce task > and shuffle too. -- This message was sent by Atlassian JIRA (v6.2#6252)