juechen507 commented on PR #30003:
URL: https://github.com/apache/spark/pull/30003#issuecomment-1370825133
Whether the spark-written-hive-bucketed-table can be read by spark-sql to do
bucket filter pruning, join, group-by?
In my test, bucket information cannot be used for group-by and join.
DDL is: set spark.sql.hive.convertMetastoreParquet=true; set
spark.sql.hive.convertMetastoreOrc=true; CREATE TABLE IF NOT EXISTS
tmp.hive_bucketed_table3 (i int, j string) PARTITIONED BY(k string) CLUSTERED
BY (i) SORTED BY (i) INTO 8 BUCKETS STORED AS orc;
hive> desc formmated tmp.hive_bucketed_table3
# col_name data_type comment
i int
j string
# Partition Information
# col_name data_type comment
k string
# Detailed Table Information
Database: tmp
Owner: hadoop-query
CreateTime: Wed Jan 04 19:32:59 CST 2023
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: (...)
Table Type: MANAGED_TABLE
Table Parameters:
spark.sql.create.version 3.3.2-SNAPSHOT
spark.sql.sources.schema
{\"type\":\"struct\",\"fields\":[{\"name\":\"i\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"j\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"k\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
spark.sql.sources.schema.bucketCol.0 i
spark.sql.sources.schema.numBucketCols 1
spark.sql.sources.schema.numBuckets 8
spark.sql.sources.schema.numPartCols 1
spark.sql.sources.schema.numSortCols 1
spark.sql.sources.schema.partCol.0 k
spark.sql.sources.schema.sortCol.0 i
transient_lastDdlTime 1672831979
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: 8
Bucket Columns: [i]
Sort Columns: [Order(col:i, order:1)]
Storage Desc Params:
serialization.format 1
spark-sql> explain select i,count(*) from tmp.hive_bucketed_table3 group by
1;
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[k#28, i#26], functions=[count(1)])
+- Exchange hashpartitioning(k#28, i#26, 200), ENSURE_REQUIREMENTS,
[plan_id=32]
+- HashAggregate(keys=[k#28, i#26], functions=[partial_count(1)])
+- FileScan orc tmp.hive_bucketed_table3[i#26,k#28] Batched: true,
DataFilters: [], Format: ORC...
spark-sql> explain select * from tmp.hive_bucketed_table3 a join
tmp.hive_bucketed_table3 b on a.i=b.i limit 1000;
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- CollectLimit 1000
+- SortMergeJoin [i#50, k#52], [i#53, k#55], Inner
:- Sort [i#50 ASC NULLS FIRST, k#52 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(i#50, k#52, 200), ENSURE_REQUIREMENTS,
[plan_id=72]
: +- Filter isnotnull(i#50)
: +- FileScan orc tmp.hive_bucketed_table3[i#50,j#51,k#52]
Batched: true, DataFilters: [isnotnull(i#50)], Format: ORC...
+- Sort [i#53 ASC NULLS FIRST, k#55 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i#53, k#55, 200), ENSURE_REQUIREMENTS,
[plan_id=73]
+- Filter isnotnull(i#53)
+- FileScan orc tmp.hive_bucketed_table3[i#53,j#54,k#55]
Batched: true, DataFilters: [isnotnull(i#53)], Format: ORC...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]