Ankur Goel wrote:
How fast do we expect the new serialization system to be when it replaces existing serialization mechanism in Hadoop RPC?
I hope that Avro will make its first release this summer. Sometime soon after, I hope that we can start moving Hadoop Core's trunk RPC onto Avro. We may start developing an experimental version of Hadoop Core that uses Avro in a branch before Avro is released. This is all speculative, of course. Any detailed discussion of Hadoop Core's future belongs on the core-dev@ and of Avro's future on avro-...@.
A clear description of the existing bottlenecks and the performance goals for this system would help developers interested in contributing.
Adding Avro to Hadoop Core is not primarily about performance but rather about compatibility and security.
Hadoop's existing RPC is not a performance bottleneck, nor is HDFS's data transfer protocol. However, currently, Hadoop requires that clients and servers must run the exact same version of code, since the existing RPC is not tolerant of protocol changes. We'd like to change that, so that one can run older clients against newer servers and vice versa. Longer term, we'd also like to permit clients in languages other than Java. We intend Avro to provide a change-tolerant, cross-platform RPC solution.
We'd also like Hadoop to become more secure. Currently Hadoop uses three different communications mechanisms: RPC, HTTP (for shuffle) and a raw socket-based protocol for HDFS data transfers. It would be best not to have to re-implement security features for each of these. So we hope that we can make Avro perform well enough to replace not only Hadoop's RPC, but also HTTP in the shuffle and the HDFS data transfer protocol.
If you're interested in discussing Avro further, I encourage you to join the Avro mailing lists.
http://hadoop.apache.org/avro/mailing_lists.html Doug
