Github user pnpranavrao commented on the issue:
https://github.com/apache/spark/pull/21460
The reason `createBucketedReadRDD` works the way currently (accumulate all
buckets across partitions into a partition with `id` equal to `bucketId`) is to
skip shuffle when joining on the bucketed columns. Your suggested change would
break this.
I opened this [JIRA](https://issues.apache.org/jira/browse/SPARK-23442) for
the same need. We need to generate different physical plans for join and
non-join usecases, and not assume if a datasource is bucketed, it will be used
for a join involving the bucketed columns like now.
The workaround right now is to turn off bucketing with a SparkConf flag,
but I'm working on using the both partitioning and bucketing information to
plan queries.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]