GitHub user wangyum opened a pull request:
https://github.com/apache/spark/pull/21460
[SPARK-23442][SQL] Improvement reading from partitioned and bucketed table.
## What changes were proposed in this pull request?
For a partitioned and bucketed table. With the increasing number of
partitions, the amount of data is getting larger and larger. Reading this table
always uses the `bucket number` of tasks.
This PR changes the logic to `bucket number` * `partition number` when
reading partitioned and bucketed table.
## How was this patch tested?
manual tests.
```scala
spark.range(10000).selectExpr(
"id as key",
"id % 5 as t1",
"id % 10 as p").repartition(5,
col("p")).write.partitionBy("p").bucketBy(5,
"key").sortBy("t1").saveAsTable("spark_23442")
```
```scala
// All partition: partition number = 5 * 10 = 50
spark.sql("select count(distinct t1) from spark_23442 ").show
```
```scala
// Filtered 1/2 partition: partition number = 5 * (10 / 2) = 25
spark.sql("select count(distinct t1) from spark_23442 where p >= 5 ").show
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wangyum/spark SPARK-23442
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21460.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21460
----
commit 58e4e098016051f41103464040ba24bbee28b2cf
Author: Yuming Wang <yumwang@...>
Date: 2018-05-30T06:53:52Z
Improvement reading from partitioned and bucketed table.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]