It looks like this was fixed in https://issues.apache.org/jira/browse/SPARK-4743 / https://github.com/apache/spark/pull/3605. Can you see whether that patch fixes this issue for you?
On Tue, Feb 17, 2015 at 8:31 PM, Matt Cheah <mch...@palantir.com> wrote: > Hi everyone, > > I was using JavaPairRDD’s combineByKey() to compute all of my aggregations > before, since I assumed that every aggregation required a key. However, I > realized I could do my analysis using JavaRDD’s aggregate() instead and not > use a key. > > I have set spark.serializer to use Kryo. As a result, JavaRDD’s > combineByKey requires that a “createCombiner” function is provided, and the > return value from that function must be serializable using Kryo. When I > switched to using rdd.aggregate I assumed that the zero value would also be > strictly Kryo serialized, as it is a data item and not part of a closure or > the aggregation functions. However, I got a serialization exception as the > closure serializer (only valid serializer is the Java serializer) was used > instead. > > I was wondering the following: > > 1. What is the rationale for making the zero value be serialized using > the closure serializer? This isn’t part of the closure, but is an initial > data item. > 2. Would it make sense for us to perhaps write a version of > rdd.aggregate() that takes a function as a parameter, that generates the > zero value? This would be more intuitive to be serialized using the closure > serializer. > > I believe aggregateByKey is also affected. > > Thanks, > > -Matt Cheah >