[GitHub] [spark] EnricoMi opened a new pull request, #38036: [SPARK-40601] Assert key size when cogrouping groups

GitBox Wed, 28 Sep 2022 06:57:49 -0700


EnricoMi opened a new pull request, #38036:
URL: https://github.com/apache/spark/pull/38036


   Cogrouping two grouped DataFrames in PySpark that have different group key 
cardinalities raises an error that is not very descriptive:
   
   ```
   py4j.protocol.Py4JJavaError: An error occurred while calling 
o726.collectToPython.
   : java.lang.IndexOutOfBoundsException: 1
        at 
scala.collection.mutable.ResizableArray.apply(ResizableArray.scala:46)
        at 
scala.collection.mutable.ResizableArray.apply$(ResizableArray.scala:45)
        at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:49)
        at 
org.apache.spark.sql.catalyst.plans.physical.HashShuffleSpec.$anonfun$createPartitioning$5(partitioning.scala:650)
   ...
   
org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$14(EnsureRequirements.scala:159)
   ```
   
   ### What changes were proposed in this pull request?
   Assert identical size of groupby keys and provide a meaningful error on 
cogroup.
   
   ### Why are the changes needed?
   The error does not provide information on how to solve the problem.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, raises an `AssertionError: group keys must have same size` instead.
   
   ### How was this patch tested?
   Adds test `test_different_group_key_cardinality` to 
`pyspark.sql.tests.test_pandas_cogrouped_map`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] EnricoMi opened a new pull request, #38036: [SPARK-40601] Assert key size when cogrouping groups

Reply via email to