On Oct 11, 2010, at 8:18 PM, Dhruba Borthakur wrote:

> I agree with others in this list that Java provides faster software
> development, the IO cost in Java is practically the same as in C/C++, etc.
> In short, most pieces of distributed software can be written in Java without
> any performance hiccups, as long as it is only system metadata that is
> handled by Java.
> 
> One problem is when data-flow has to occur in Java. Each record that is read
> from the storage has to be de-serialized, uncompressed and then processed.
> This processing could be very slow in Java compared to when written in other
> languages, especially because of the creation/destruction of too many
> objects.  It would have been nice if the map/reduce task could have been
> written in C/C++, or better still, if the sorting inside the MR framework
> could occur in C/C++.
> 
> thanks,
> dhruba

There are many places left in Hadoop's Design to improve on the performance of 
these actions.

The use of InputStream/OutputStream is not optimal at the record level in the 
intermediate data, for example.  Essentially, the fact that Writable interface 
cause per-tuple access of the slow InputStream/OutputStream API's is a problem. 
 As a guideline, never read from, or write to, InputStream/OutputStream in 
chunks less than 128 bytes, and optimally go for 512 bytes+.  Some bits of 
Hadoop have done this optimization (TextInputFormat) and seen gains.

As for memory consumption of Java itself, tuning the JVM parameters can go a 
long way, especially making sure that -XX:MaxNewSize is set so that in larger 
heaps, the default 1/3 of the heap is not consumed by the young generation.  
And the most recent JVM has enabled the -XX:+DoEscapeAnalysis flag to elide 
object allocation in several cases.   Both that flag and another memory saving 
flag, -XX:+UseCompressedOops will be defaults in the next major Hotspot update.

OpenJDK's Java 7 has two new sorting routines that improve Java sort 
performance by 20% to 100% too.  Hadoop could implement these algorithms 
(TimSort and Dual Pivot Quicksort).  I've seen a 10% performance gain in 
non-hadoop applications when experimenting with the latest OpenJDK, which does 
register allocation and array bounds check elimination better than the current 
JRE 6.

In short, there is a lot left to do to Hadoop IMO to improve its performance, 
and Java is a great language for being able to safely do larger scale 
refactorings and evolve a product.  And the JVM itself is continuing to improve.


> 
> On Mon, Oct 11, 2010 at 4:50 PM, helwr <[email protected]> wrote:
> 
>> 
>> Check out this thread:
>> https://www.quora.com/Why-was-Hadoop-written-in-Java
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Why-hadoop-is-written-in-java-tp1673148p1684291.html
>> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>> 
> 
> 
> 
> -- 
> Connect to me at http://www.facebook.com/dhruba

Reply via email to