Technically you can already do custom serializer for each shuffle operation
(it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues
(or github) in the past a "storage policy" in which you can specify how
data should be stored. I think that would be a great API to have in the
long run. Designing it won't be trivial though.


On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Hey all,
>
> Was messing around with Spark and Google FlatBuffers for fun, and it got me
> thinking about Spark and serialization.  I know there's been work / talk
> about in-memory columnar formats Spark SQL, so maybe there are ways to
> provide this flexibility already that I've missed?  Either way, my
> thoughts:
>
> Java and Kryo serialization are really nice in that they require almost no
> extra work on the part of the user.  They can also represent complex object
> graphs with cycles etc.
>
> There are situations where other serialization frameworks are more
> efficient:
> * A Hadoop Writable style format that delineates key-value boundaries and
> allows for raw comparisons can greatly speed up some shuffle operations by
> entirely avoiding deserialization until the object hits user code.
> Writables also probably ser / deser faster than Kryo.
> * "No-deserialization" formats like FlatBuffers and Cap'n Proto address the
> tradeoff between (1) Java objects that offer fast access but take lots of
> space and stress GC and (2) Kryo-serialized buffers that are more compact
> but take time to deserialize.
>
> The drawbacks of these frameworks are that they require more work from the
> user to define types.  And that they're more restrictive in the reference
> graphs they can represent.
>
> In large applications, there are probably a few points where a
> "specialized" serialization format is useful. But requiring Writables
> everywhere because they're needed in a particularly intense shuffle is
> cumbersome.
>
> In light of that, would it make sense to enable varying Serializers within
> an app? It could make sense to choose a serialization framework both based
> on the objects being serialized and what they're being serialized for
> (caching vs. shuffle).  It might be possible to implement this underneath
> the Serializer interface with some sort of multiplexing serializer that
> chooses between subserializers.
>
> Nothing urgent here, but curious to hear other's opinions.
>
> -Sandy
>

Reply via email to