On Oct 11, 2010, at 8:18 PM, Dhruba Borthakur wrote: > I agree with others in this list that Java provides faster software > development, the IO cost in Java is practically the same as in C/C++, etc. > In short, most pieces of distributed software can be written in Java without > any performance hiccups, as long as it is only system metadata that is > handled by Java. > > One problem is when data-flow has to occur in Java. Each record that is read > from the storage has to be de-serialized, uncompressed and then processed. > This processing could be very slow in Java compared to when written in other > languages, especially because of the creation/destruction of too many > objects. It would have been nice if the map/reduce task could have been > written in C/C++, or better still, if the sorting inside the MR framework > could occur in C/C++. > > thanks, > dhruba
There are many places left in Hadoop's Design to improve on the performance of these actions. The use of InputStream/OutputStream is not optimal at the record level in the intermediate data, for example. Essentially, the fact that Writable interface cause per-tuple access of the slow InputStream/OutputStream API's is a problem. As a guideline, never read from, or write to, InputStream/OutputStream in chunks less than 128 bytes, and optimally go for 512 bytes+. Some bits of Hadoop have done this optimization (TextInputFormat) and seen gains. As for memory consumption of Java itself, tuning the JVM parameters can go a long way, especially making sure that -XX:MaxNewSize is set so that in larger heaps, the default 1/3 of the heap is not consumed by the young generation. And the most recent JVM has enabled the -XX:+DoEscapeAnalysis flag to elide object allocation in several cases. Both that flag and another memory saving flag, -XX:+UseCompressedOops will be defaults in the next major Hotspot update. OpenJDK's Java 7 has two new sorting routines that improve Java sort performance by 20% to 100% too. Hadoop could implement these algorithms (TimSort and Dual Pivot Quicksort). I've seen a 10% performance gain in non-hadoop applications when experimenting with the latest OpenJDK, which does register allocation and array bounds check elimination better than the current JRE 6. In short, there is a lot left to do to Hadoop IMO to improve its performance, and Java is a great language for being able to safely do larger scale refactorings and evolve a product. And the JVM itself is continuing to improve. > > On Mon, Oct 11, 2010 at 4:50 PM, helwr <[email protected]> wrote: > >> >> Check out this thread: >> https://www.quora.com/Why-was-Hadoop-written-in-Java >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Why-hadoop-is-written-in-java-tp1673148p1684291.html >> Sent from the Hadoop lucene-users mailing list archive at Nabble.com. >> > > > > -- > Connect to me at http://www.facebook.com/dhruba
