Ah awesome.  Passing customer serializers when persisting an RDD is exactly
one of the things I was thinking of.

-Sandy

On Fri, Nov 7, 2014 at 1:19 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Yup, the JIRA for this was https://issues.apache.org/jira/browse/SPARK-540
> (one of our older JIRAs). I think it would be interesting to explore this
> further. Basically the way to add it into the API would be to add a version
> of persist() that takes another class than StorageLevel, say
> StorageStrategy, which allows specifying a custom serializer or perhaps
> even a transformation to turn each partition into another representation
> before saving it. It would also be interesting if this could work directly
> on an InputStream or ByteBuffer to deal with off-heap data.
>
> One issue we've found with our current Serializer interface by the way is
> that a lot of type information is lost when you pass data to it, so the
> serializers spend a fair bit of time figuring out what class each object
> written is. With this model, it would be possible for a serializer to know
> that all its data is of one type, which is pretty cool, but we might also
> consider ways of expanding the current Serializer interface to take more
> info.
>
> Matei
>
> > On Nov 7, 2014, at 1:09 AM, Reynold Xin <r...@databricks.com> wrote:
> >
> > Technically you can already do custom serializer for each shuffle
> operation
> > (it is part of the ShuffledRDD). I've seen Matei suggesting on jira
> issues
> > (or github) in the past a "storage policy" in which you can specify how
> > data should be stored. I think that would be a great API to have in the
> > long run. Designing it won't be trivial though.
> >
> >
> > On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
> >
> >> Hey all,
> >>
> >> Was messing around with Spark and Google FlatBuffers for fun, and it
> got me
> >> thinking about Spark and serialization.  I know there's been work / talk
> >> about in-memory columnar formats Spark SQL, so maybe there are ways to
> >> provide this flexibility already that I've missed?  Either way, my
> >> thoughts:
> >>
> >> Java and Kryo serialization are really nice in that they require almost
> no
> >> extra work on the part of the user.  They can also represent complex
> object
> >> graphs with cycles etc.
> >>
> >> There are situations where other serialization frameworks are more
> >> efficient:
> >> * A Hadoop Writable style format that delineates key-value boundaries
> and
> >> allows for raw comparisons can greatly speed up some shuffle operations
> by
> >> entirely avoiding deserialization until the object hits user code.
> >> Writables also probably ser / deser faster than Kryo.
> >> * "No-deserialization" formats like FlatBuffers and Cap'n Proto address
> the
> >> tradeoff between (1) Java objects that offer fast access but take lots
> of
> >> space and stress GC and (2) Kryo-serialized buffers that are more
> compact
> >> but take time to deserialize.
> >>
> >> The drawbacks of these frameworks are that they require more work from
> the
> >> user to define types.  And that they're more restrictive in the
> reference
> >> graphs they can represent.
> >>
> >> In large applications, there are probably a few points where a
> >> "specialized" serialization format is useful. But requiring Writables
> >> everywhere because they're needed in a particularly intense shuffle is
> >> cumbersome.
> >>
> >> In light of that, would it make sense to enable varying Serializers
> within
> >> an app? It could make sense to choose a serialization framework both
> based
> >> on the objects being serialized and what they're being serialized for
> >> (caching vs. shuffle).  It might be possible to implement this
> underneath
> >> the Serializer interface with some sort of multiplexing serializer that
> >> chooses between subserializers.
> >>
> >> Nothing urgent here, but curious to hear other's opinions.
> >>
> >> -Sandy
> >>
>
>

Reply via email to