Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

Josh Rosen Wed, 18 Feb 2015 06:14:07 -0800

It looks like this was fixed in
https://issues.apache.org/jira/browse/SPARK-4743 /
https://github.com/apache/spark/pull/3605.  Can you see whether that patch
fixes this issue for you?




On Tue, Feb 17, 2015 at 8:31 PM, Matt Cheah <[email protected]> wrote:

> Hi everyone,
>
> I was using JavaPairRDD’s combineByKey() to compute all of my aggregations
> before, since I assumed that every aggregation required a key. However, I
> realized I could do my analysis using JavaRDD’s aggregate() instead and not
> use a key.
>
> I have set spark.serializer to use Kryo. As a result, JavaRDD’s
> combineByKey requires that a “createCombiner” function is provided, and the
> return value from that function must be serializable using Kryo. When I
> switched to using rdd.aggregate I assumed that the zero value would also be
> strictly Kryo serialized, as it is a data item and not part of a closure or
> the aggregation functions. However, I got a serialization exception as the
> closure serializer (only valid serializer is the Java serializer) was used
> instead.
>
> I was wondering the following:
>
>    1. What is the rationale for making the zero value be serialized using
>    the closure serializer? This isn’t part of the closure, but is an initial
>    data item.
>    2. Would it make sense for us to perhaps write a version of
>    rdd.aggregate() that takes a function as a parameter, that generates the
>    zero value? This would be more intuitive to be serialized using the closure
>    serializer.
>
> I believe aggregateByKey is also affected.
>
> Thanks,
>
> -Matt Cheah
>

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

Reply via email to