[GitHub] spark pull request #17644: [SPARK-17729] [SQL] Enable creating hive bucketed...

tejasapatil Thu, 11 May 2017 07:55:42 -0700

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17644#discussion_r116012486
  
    --- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
    @@ -871,6 +886,23 @@ private[hive] object HiveClientImpl {
           hiveTable.setViewOriginalText(t)
           hiveTable.setViewExpandedText(t)
         }
    +
    +    table.bucketSpec match {
    +      case Some(bucketSpec) =>
    +        hiveTable.setNumBuckets(bucketSpec.numBuckets)
    +        hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)
    --- End diff --
    
    I didn't get you. Can you please elaborate ?
    
    With this PR, here is the behavior:
    
    bucketed table created using dataframe API
    
    ```
    scala> 
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("spark_native1")
    17/05/11 07:34:23 WARN HiveExternalCatalog: Persisting bucketed data source 
table `default`.`spark_native1` into Hive metastore in Spark SQL specific 
format, which is NOT compatible with Hive.
    
    scala> spark.sql(" desc formatted table1 ").collect.foreach(println)
    [# col_name,data_type,comment]
    [i,int,null]
    [j,int,null]
    [k,string,null]
    [,,]
    [# Detailed Table Information,,]
    [Database,default,]
    [Table,table1,]
    [Owner,tejasp,]
    [Created,Fri Apr 14 17:35:51 PDT 2017,]
    [Last Access,Wed Dec 31 16:00:00 PST 1969,]
    [Type,MANAGED,]
    [Provider,org.apache.spark.sql.hive.orc.OrcFileFormat,]
    [Num Buckets,8,]
    [Bucket Columns,[`j`, `k`],]
    [Sort Columns,[`j`, `k`],]
    [Properties,[serialization.format=1],]
    [Statistics,2938 bytes, 16 rows,]
    
[Location,file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/table1,]
    [Serde Library,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
    [InputFormat,org.apache.hadoop.mapred.SequenceFileInputFormat,]
    [OutputFormat,org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,]
    
    # Hive CLI
    hive>  desc formatted spark_native1 ;
    OK
    # col_name                  data_type               comment
    
    col                         array<string>           from deserializer
    
    # Detailed Table Information
    Database:                   default
    Owner:                      tejasp
    CreateTime:                 Thu May 11 07:34:23 PDT 2017
    LastAccessTime:             UNKNOWN
    Protect Mode:               None
    Retention:                  0
    Location:                   
file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/spark_native1
    Table Type:                 MANAGED_TABLE
    Table Parameters:
        COLUMN_STATS_ACCURATE   false
        numFiles                8
        numRows                 -1
        rawDataSize             -1
        spark.sql.sources.provider      
org.apache.spark.sql.hive.orc.OrcFileFormat
        spark.sql.sources.schema.bucketCol.0    j
        spark.sql.sources.schema.bucketCol.1    k
        spark.sql.sources.schema.numBucketCols  2
        spark.sql.sources.schema.numBuckets     8
        spark.sql.sources.schema.numParts       1
        spark.sql.sources.schema.numSortCols    2
        spark.sql.sources.schema.part.0 
{\"type\":\"struct\",\"fields\":[{\"name\":\"i\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"j\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"k\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
        spark.sql.sources.schema.sortCol.0      j
        spark.sql.sources.schema.sortCol.1      k
        totalSize               2938
        transient_lastDdlTime   1494513263
    
    # Storage Information
    SerDe Library:              
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat:                org.apache.hadoop.mapred.SequenceFileInputFormat
    OutputFormat:               
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
    Compressed:                 No
    Num Buckets:                -1
    Bucket Columns:             []
    Sort Columns:               []
    Storage Desc Params:
        path                    
file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/spark_native1
        serialization.format    1
    Time taken: 0.057 seconds, Fetched: 42 row(s)
    ```
    
    bucketed table created via DDL
    
    ```
    scala> hc.sql("""
         | CREATE TABLE `bucket_table_spark_created`(`user_id` int,`name` 
string)
         | PARTITIONED BY (`ds` string)
         | CLUSTERED BY (user_id)
         | SORTED BY (user_id ASC)
         | INTO 8 BUCKETS
         | """)
    res2: org.apache.spark.sql.DataFrame = []
    
    scala> hc.sql(""" desc formatted bucket_table_spark_created 
""").collect.foreach(println)
    [# col_name,data_type,comment]
    [user_id,int,null]
    [name,string,null]
    [ds,string,null]
    [# Partition Information,,]
    [# col_name,data_type,comment]
    [ds,string,null]
    [,,]
    [# Detailed Table Information,,]
    [Database,default,]
    [Table,bucket_table_spark_created,]
    [Owner,tejasp,]
    [Created,Thu May 11 07:48:07 PDT 2017,]
    [Last Access,Wed Dec 31 16:00:00 PST 1969,]
    [Type,MANAGED,]
    [Provider,hive,]
    [Num Buckets,8,]
    [Bucket Columns,[`user_id`],]
    [Sort Columns,[`user_id`],]
    [Properties,[serialization.format=1],]
    
[Location,file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/bucket_table_spark_created,]
    [Serde Library,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
    [InputFormat,org.apache.hadoop.mapred.TextInputFormat,]
    [OutputFormat,org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,]
    [Partition Provider,Catalog,]
    
    
    # Hive CLI
    hive> desc formatted bucket_table_spark_created ;
    OK
    # col_name                  data_type               comment
    
    user_id                     int
    name                        string
    
    # Partition Information
    # col_name                  data_type               comment
    
    ds                          string
    
    # Detailed Table Information
    Database:                   default
    Owner:                      tejasp
    CreateTime:                 Thu May 11 07:48:07 PDT 2017
    LastAccessTime:             UNKNOWN
    Protect Mode:               None
    Retention:                  0
    Location:                   
file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/bucket_table_spark_created
    Table Type:                 MANAGED_TABLE
    Table Parameters:
        spark.sql.sources.schema.bucketCol.0    user_id
        spark.sql.sources.schema.numBucketCols  1
        spark.sql.sources.schema.numBuckets     8
        spark.sql.sources.schema.numPartCols    1
        spark.sql.sources.schema.numParts       1
        spark.sql.sources.schema.numSortCols    1
        spark.sql.sources.schema.part.0 
{\"type\":\"struct\",\"fields\":[{\"name\":\"user_id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ds\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
        spark.sql.sources.schema.partCol.0      ds
        spark.sql.sources.schema.sortCol.0      user_id
        transient_lastDdlTime   1494514087
    
    # Storage Information
    SerDe Library:              
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat:                org.apache.hadoop.mapred.TextInputFormat
    OutputFormat:               
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    Compressed:                 No
    Num Buckets:                8
    Bucket Columns:             [user_id]
    Sort Columns:               [Order(col:user_id, order:1)]
    Storage Desc Params:
        serialization.format    1
    Time taken: 1.169 seconds, Fetched: 41 row(s)
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17644: [SPARK-17729] [SQL] Enable creating hive bucketed...

Reply via email to