[GitHub] [spark] juechen507 commented on pull request #30003: [SPARK-32709][SQL] Support write Hive ORC/Parquet bucketed table (for Hive 1,2)

GitBox Wed, 04 Jan 2023 03:42:43 -0800


juechen507 commented on PR #30003:
URL: https://github.com/apache/spark/pull/30003#issuecomment-1370825133


   Whether the spark-written-hive-bucketed-table can be read by spark-sql to do 
bucket filter pruning, join, group-by?
   
   In my test, bucket information cannot be used for group-by and join.
   DDL is: set spark.sql.hive.convertMetastoreParquet=true; set 
spark.sql.hive.convertMetastoreOrc=true; CREATE TABLE IF NOT EXISTS 
tmp.hive_bucketed_table3 (i int, j string) PARTITIONED BY(k string) CLUSTERED 
BY (i) SORTED BY (i) INTO 8 BUCKETS STORED AS orc;
   hive> desc formmated tmp.hive_bucketed_table3
   # col_name                   data_type               comment
   i                    int
   j                    string
   # Partition Information
   # col_name                   data_type               comment
   k                    string
   # Detailed Table Information
   Database:            tmp
   Owner:               hadoop-query
   CreateTime:          Wed Jan 04 19:32:59 CST 2023
   LastAccessTime:      UNKNOWN
   Protect Mode:        None
   Retention:           0
   Location:            (...)
   Table Type:          MANAGED_TABLE
   Table Parameters:
        spark.sql.create.version        3.3.2-SNAPSHOT
        spark.sql.sources.schema        
{\"type\":\"struct\",\"fields\":[{\"name\":\"i\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"j\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"k\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
        spark.sql.sources.schema.bucketCol.0    i
        spark.sql.sources.schema.numBucketCols  1
        spark.sql.sources.schema.numBuckets     8
        spark.sql.sources.schema.numPartCols    1
        spark.sql.sources.schema.numSortCols    1
        spark.sql.sources.schema.partCol.0      k
        spark.sql.sources.schema.sortCol.0      i
        transient_lastDdlTime   1672831979
   # Storage Information
   SerDe Library:       org.apache.hadoop.hive.ql.io.orc.OrcSerde
   InputFormat:         org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
   OutputFormat:        org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
   Compressed:          No
   Num Buckets:         8
   Bucket Columns:      [i]
   Sort Columns:        [Order(col:i, order:1)]
   Storage Desc Params:
        serialization.format    1
     
   spark-sql> explain select i,count(*) from tmp.hive_bucketed_table3 group by 
1;
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- HashAggregate(keys=[k#28, i#26], functions=[count(1)])
      +- Exchange hashpartitioning(k#28, i#26, 200), ENSURE_REQUIREMENTS, 
[plan_id=32]
         +- HashAggregate(keys=[k#28, i#26], functions=[partial_count(1)])
            +- FileScan orc tmp.hive_bucketed_table3[i#26,k#28] Batched: true, 
DataFilters: [], Format: ORC...
   
   spark-sql> explain select * from tmp.hive_bucketed_table3 a join 
tmp.hive_bucketed_table3 b on a.i=b.i limit 1000;
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- CollectLimit 1000
      +- SortMergeJoin [i#50, k#52], [i#53, k#55], Inner
         :- Sort [i#50 ASC NULLS FIRST, k#52 ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(i#50, k#52, 200), ENSURE_REQUIREMENTS, 
[plan_id=72]
         :     +- Filter isnotnull(i#50)
         :        +- FileScan orc tmp.hive_bucketed_table3[i#50,j#51,k#52] 
Batched: true, DataFilters: [isnotnull(i#50)], Format: ORC...
         +- Sort [i#53 ASC NULLS FIRST, k#55 ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(i#53, k#55, 200), ENSURE_REQUIREMENTS, 
[plan_id=73]
               +- Filter isnotnull(i#53)
                  +- FileScan orc tmp.hive_bucketed_table3[i#53,j#54,k#55] 
Batched: true, DataFilters: [isnotnull(i#53)], Format: ORC...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] juechen507 commented on pull request #30003: [SPARK-32709][SQL] Support write Hive ORC/Parquet bucketed table (for Hive 1,2)

Reply via email to