Yup, the JIRA for this was https://issues.apache.org/jira/browse/SPARK-540 (one 
of our older JIRAs). I think it would be interesting to explore this further. 
Basically the way to add it into the API would be to add a version of persist() 
that takes another class than StorageLevel, say StorageStrategy, which allows 
specifying a custom serializer or perhaps even a transformation to turn each 
partition into another representation before saving it. It would also be 
interesting if this could work directly on an InputStream or ByteBuffer to deal 
with off-heap data.

One issue we've found with our current Serializer interface by the way is that 
a lot of type information is lost when you pass data to it, so the serializers 
spend a fair bit of time figuring out what class each object written is. With 
this model, it would be possible for a serializer to know that all its data is 
of one type, which is pretty cool, but we might also consider ways of expanding 
the current Serializer interface to take more info.

Matei

> On Nov 7, 2014, at 1:09 AM, Reynold Xin <r...@databricks.com> wrote:
> 
> Technically you can already do custom serializer for each shuffle operation
> (it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues
> (or github) in the past a "storage policy" in which you can specify how
> data should be stored. I think that would be a great API to have in the
> long run. Designing it won't be trivial though.
> 
> 
> On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
> 
>> Hey all,
>> 
>> Was messing around with Spark and Google FlatBuffers for fun, and it got me
>> thinking about Spark and serialization.  I know there's been work / talk
>> about in-memory columnar formats Spark SQL, so maybe there are ways to
>> provide this flexibility already that I've missed?  Either way, my
>> thoughts:
>> 
>> Java and Kryo serialization are really nice in that they require almost no
>> extra work on the part of the user.  They can also represent complex object
>> graphs with cycles etc.
>> 
>> There are situations where other serialization frameworks are more
>> efficient:
>> * A Hadoop Writable style format that delineates key-value boundaries and
>> allows for raw comparisons can greatly speed up some shuffle operations by
>> entirely avoiding deserialization until the object hits user code.
>> Writables also probably ser / deser faster than Kryo.
>> * "No-deserialization" formats like FlatBuffers and Cap'n Proto address the
>> tradeoff between (1) Java objects that offer fast access but take lots of
>> space and stress GC and (2) Kryo-serialized buffers that are more compact
>> but take time to deserialize.
>> 
>> The drawbacks of these frameworks are that they require more work from the
>> user to define types.  And that they're more restrictive in the reference
>> graphs they can represent.
>> 
>> In large applications, there are probably a few points where a
>> "specialized" serialization format is useful. But requiring Writables
>> everywhere because they're needed in a particularly intense shuffle is
>> cumbersome.
>> 
>> In light of that, would it make sense to enable varying Serializers within
>> an app? It could make sense to choose a serialization framework both based
>> on the objects being serialized and what they're being serialized for
>> (caching vs. shuffle).  It might be possible to implement this underneath
>> the Serializer interface with some sort of multiplexing serializer that
>> chooses between subserializers.
>> 
>> Nothing urgent here, but curious to hear other's opinions.
>> 
>> -Sandy
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to