Hey all,

Was messing around with Spark and Google FlatBuffers for fun, and it got me
thinking about Spark and serialization.  I know there's been work / talk
about in-memory columnar formats Spark SQL, so maybe there are ways to
provide this flexibility already that I've missed?  Either way, my thoughts:

Java and Kryo serialization are really nice in that they require almost no
extra work on the part of the user.  They can also represent complex object
graphs with cycles etc.

There are situations where other serialization frameworks are more
efficient:
* A Hadoop Writable style format that delineates key-value boundaries and
allows for raw comparisons can greatly speed up some shuffle operations by
entirely avoiding deserialization until the object hits user code.
Writables also probably ser / deser faster than Kryo.
* "No-deserialization" formats like FlatBuffers and Cap'n Proto address the
tradeoff between (1) Java objects that offer fast access but take lots of
space and stress GC and (2) Kryo-serialized buffers that are more compact
but take time to deserialize.

The drawbacks of these frameworks are that they require more work from the
user to define types.  And that they're more restrictive in the reference
graphs they can represent.

In large applications, there are probably a few points where a
"specialized" serialization format is useful. But requiring Writables
everywhere because they're needed in a particularly intense shuffle is
cumbersome.

In light of that, would it make sense to enable varying Serializers within
an app? It could make sense to choose a serialization framework both based
on the objects being serialized and what they're being serialized for
(caching vs. shuffle).  It might be possible to implement this underneath
the Serializer interface with some sort of multiplexing serializer that
chooses between subserializers.

Nothing urgent here, but curious to hear other's opinions.

-Sandy

Reply via email to