HyukjinKwon opened a new pull request #28648:
URL: https://github.com/apache/spark/pull/28648


   ### What changes were proposed in this pull request?
   
   This PR manually specifies the class for the input array being used in 
`(SparkContext|StreamingContext).union`. It fixes a regressionintroduced from 
SPARK-25737.
   
   ```python
   rdd1 = sc.parallelize([1,2,3,4,5])
   rdd2 = sc.parallelize([6,7,8,9,10])
   pairRDD1 = rdd1.zip(rdd2)
   sc.union([pairRDD1, pairRDD1]).collect()
   ```
   
   in the current master and `branch-3.0`:
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../spark/python/pyspark/context.py", line 870, in union
       jrdds[i] = rdds[i]._jrdd
     File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", 
line 238, in __setitem__
     File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", 
line 221, in __set_item
     File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value
   py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:
   py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to 
org.apache.spark.api.java.JavaRDD
        at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166)
        at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144)
        at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
   ```
   
   which works in Spark 2.4.5:
   
   ```
   [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), 
(5, 10)]
   ```
   
   It assumed the class of the input array is the same; however, that can be 
different.
   
   This fix is based on @redsanket's initial approach, and will be co-authored.
   
   ### Why are the changes needed?
   
   To fix a regression from Spark 2.4.5.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's only in unreleased branches. This is to fix a regression.
   
   ### How was this patch tested?
   
   Manually tested, and a unittest was added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to