redsanket opened a new pull request #28603:
URL: https://github.com/apache/spark/pull/28603


   ### What changes were proposed in this pull request?
   UnionRDD of PairRDDs causing a bug. The fix is to check for instance type 
before proceeding
   
   ### Why are the changes needed?
   We can reproduce via
   
   SparkSession available as 'spark'.
   >>> rdd1 = sc.parallelize([1,2,3,4,5])
   >>> rdd2 = sc.parallelize([6,7,8,9,10])
   >>> pairRDD1 = rdd1.zip(rdd2)
   >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
   Traceback (most recent call last): File "<stdin>", line 1, in <module> File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870,
   in union jrdds[i] = rdds[i]._jrdd
   File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in setitem File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221,
   in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Tested with the reproduced piece of code above manually
   After the patch
   >>> rdd2 = sc.parallelize([6,7,8,9,10])
   >>> pairRDD1 = rdd1.zip(rdd2)
   >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
   >>> unionRDD1.collect()
   [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), 
(5, 10)]


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to