HyukjinKwon opened a new pull request #28648:
URL: https://github.com/apache/spark/pull/28648
### What changes were proposed in this pull request?
This PR manually specifies the class for the input array being used in
`(SparkContext|StreamingContext).union`. It fixes a regressionintroduced from
SPARK-25737.
```python
rdd1 = sc.parallelize([1,2,3,4,5])
rdd2 = sc.parallelize([6,7,8,9,10])
pairRDD1 = rdd1.zip(rdd2)
sc.union([pairRDD1, pairRDD1]).collect()
```
in the current master and `branch-3.0`:
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/context.py", line 870, in union
jrdds[i] = rdds[i]._jrdd
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
line 238, in __setitem__
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
line 221, in __set_item
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line
332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:
py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to
org.apache.spark.api.java.JavaRDD
at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166)
at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144)
at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
```
which works in Spark 2.4.5:
```
[(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9),
(5, 10)]
```
It assumed the class of the input array is the same; however, that can be
different.
This fix is based on @redsanket's initial approach, and will be co-authored.
### Why are the changes needed?
To fix a regression from Spark 2.4.5.
### Does this PR introduce _any_ user-facing change?
No, it's only in unreleased branches. This is to fix a regression.
### How was this patch tested?
Manually tested, and a unittest was added.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]