RazoEtsy commented on issue #15119: URL: https://github.com/apache/iceberg/issues/15119#issuecomment-3801941196
hey :wave-ralph: , similar issue here. Im trying to merge 3 datasets and it is failing with a similar error `java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: List(30000, 48882)` All 3 datasets are bucketed by the same key and have the same number of files. I noticed that the `SparkPartitioningAwareScan` was reporting that one of the partitions was greater than the other two. `SparkPartitioningAwareScan: Reporting KeyGroupedPartitioning by [identity(foo), identity(bar)] with 30000 partition(s) for table baz ` `SparkPartitioningAwareScan: Reporting KeyGroupedPartitioning by [identity(foo), identity(bar)] with 30000 partition(s) for table quux` `SparkPartitioningAwareScan: Reporting KeyGroupedPartitioning by [identity(foo), identity(bar)] with 48882 partition(s) for table garply` On disk, all three partitions have the same number of files, but one contains significantly more data. My hunch is that the logical partition is the source of the problems but i dont know how to prove it :melting_face: In case it helps for debugging i ran successfully a 3 dataset SPJ join with similar data size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
