Github user pnpranavrao commented on the issue:

    https://github.com/apache/spark/pull/21460
  
    The reason `createBucketedReadRDD` works the way currently (accumulate all 
buckets across partitions into a partition with `id` equal to `bucketId`) is to 
skip shuffle when joining on the bucketed columns. Your suggested change would 
break this. 
    
    I opened this [JIRA](https://issues.apache.org/jira/browse/SPARK-23442) for 
the same need. We need to generate different physical plans for join and 
non-join usecases, and not assume if a datasource is bucketed, it will be used 
for a join involving the bucketed columns like now. 
    
    The workaround right now is to turn off bucketing with a SparkConf flag, 
but I'm working on using the both partitioning and bucketing information to 
plan queries.   


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to