Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/3146#issuecomment-62113295
  
    **TL;DR:** The goal is to keep the network package small, with minimal 
dependencies and minimal overhead to verify cross-version compatibility moving 
forward. It is my feeling that protobuf and thrift are expensive dependencies 
to have, and that Java serialization is harder to reason about.
    
    The problem with using thrift or protobuf is inherently about dependencies. 
Protobuf dependencies are already a mess in Spark due to different, 
backwards-incompatible versions being used in Hadoop, Mesos, Akka, etc., and 
adding a real dependency in Spark just complicates the issue. Thrift is another 
relatively common dependency and has a few extra dependencies of its own, but I 
haven't explored that route as far. Since the code here is intended to work 
while running within other JVMs (e.g., YARN Node Manager), we want to keep 
dependencies down.
    
    Other parts of the network package use the "Encodable" interface because 
they write directly to Netty and this API is thus natural (decoding ByteBufs 
from an IO buffer, for instance). The choice of using Encodable here rather 
than implementing Externalizable/Serializable objects is for two reasons: 
simplicity and flexibility. The Java serialization framework brings a lot of 
baggage and has some non-obvious pitfalls, and accidental misuse may go 
unnoticed until the serial version id mismatch errors arrive. Second, it is 
less obvious how to explicitly handle changes in classes between versions. 
Since we expect the shuffle service to be long-lived, we must be able to simply 
and straightforwardly verify that code will work in a cross-version manner, and 
I feel that that is harder to prove when relying on Java serialization.
    
    Finally, the thing that makes this problem tractable, in my opinion, is 
that we should never be serializing complex object graphs at this level of the 
API. Everything should be ultimately simple, primitive types with minimal to no 
abstract types. We're not trying to solve serialization of general objects, 
just serialization of small, mostly static messages. Arrays of Strings should 
be the most complicated things we have to serialize.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to