Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/3146#issuecomment-62113295 **TL;DR:** The goal is to keep the network package small, with minimal dependencies and minimal overhead to verify cross-version compatibility moving forward. It is my feeling that protobuf and thrift are expensive dependencies to have, and that Java serialization is harder to reason about. The problem with using thrift or protobuf is inherently about dependencies. Protobuf dependencies are already a mess in Spark due to different, backwards-incompatible versions being used in Hadoop, Mesos, Akka, etc., and adding a real dependency in Spark just complicates the issue. Thrift is another relatively common dependency and has a few extra dependencies of its own, but I haven't explored that route as far. Since the code here is intended to work while running within other JVMs (e.g., YARN Node Manager), we want to keep dependencies down. Other parts of the network package use the "Encodable" interface because they write directly to Netty and this API is thus natural (decoding ByteBufs from an IO buffer, for instance). The choice of using Encodable here rather than implementing Externalizable/Serializable objects is for two reasons: simplicity and flexibility. The Java serialization framework brings a lot of baggage and has some non-obvious pitfalls, and accidental misuse may go unnoticed until the serial version id mismatch errors arrive. Second, it is less obvious how to explicitly handle changes in classes between versions. Since we expect the shuffle service to be long-lived, we must be able to simply and straightforwardly verify that code will work in a cross-version manner, and I feel that that is harder to prove when relying on Java serialization. Finally, the thing that makes this problem tractable, in my opinion, is that we should never be serializing complex object graphs at this level of the API. Everything should be ultimately simple, primitive types with minimal to no abstract types. We're not trying to solve serialization of general objects, just serialization of small, mostly static messages. Arrays of Strings should be the most complicated things we have to serialize.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org