Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2805#issuecomment-59626413
  
    I can confirm that this seems to have fixed the serialization issue; here's 
my test-case:
    
    ```scala
    import org.apache.spark.api.java._
    val pairs = sc.parallelize(1 to 10).map(x => (x, x))
    val map = new JavaPairRDD(pairs).collectAsMap()
    def ser(a: AnyRef) =
        (new java.io.ObjectOutputStream(new 
java.io.ByteArrayOutputStream())).writeObject(a)
    ser(map)
    ```
    
    It looks like there's one more case in 
`sql/core/src/main/scala/org/apache/spark/sql/api/java/Row.scala` that needs to 
be addressed: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/java/Row.scala#L117.
  This is a private method, but its return value flows to user-code.  I'll fix 
this up myself on merge.
    
    There still might be some other corner-cases with serializability of 
results that we haven't tested yet.  The result of `collect()` is serializable, 
so perhaps this issue only affected our use of MapWrapper.  Long term, it would 
be great to add a fuzz-test that runs random Java API workloads and attempts to 
serialize their results.
    
    I mentioned this over on JIRA, but for GitHub readers: I've opened an issue 
to fix this upstream in Scala: https://issues.scala-lang.org/browse/SI-8911
    
    I'll merge this now with my fixup.  Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to