Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/2805#issuecomment-59626413
I can confirm that this seems to have fixed the serialization issue; here's
my test-case:
```scala
import org.apache.spark.api.java._
val pairs = sc.parallelize(1 to 10).map(x => (x, x))
val map = new JavaPairRDD(pairs).collectAsMap()
def ser(a: AnyRef) =
(new java.io.ObjectOutputStream(new
java.io.ByteArrayOutputStream())).writeObject(a)
ser(map)
```
It looks like there's one more case in
`sql/core/src/main/scala/org/apache/spark/sql/api/java/Row.scala` that needs to
be addressed:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/java/Row.scala#L117.
This is a private method, but its return value flows to user-code. I'll fix
this up myself on merge.
There still might be some other corner-cases with serializability of
results that we haven't tested yet. The result of `collect()` is serializable,
so perhaps this issue only affected our use of MapWrapper. Long term, it would
be great to add a fuzz-test that runs random Java API workloads and attempts to
serialize their results.
I mentioned this over on JIRA, but for GitHub readers: I've opened an issue
to fix this upstream in Scala: https://issues.scala-lang.org/browse/SI-8911
I'll merge this now with my fixup. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]