Github user tejasapatil commented on a diff in the pull request:
https://github.com/apache/spark/pull/17644#discussion_r116012486
--- Diff:
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
---
@@ -871,6 +886,23 @@ private[hive] object HiveClientImpl {
hiveTable.setViewOriginalText(t)
hiveTable.setViewExpandedText(t)
}
+
+ table.bucketSpec match {
+ case Some(bucketSpec) =>
+ hiveTable.setNumBuckets(bucketSpec.numBuckets)
+ hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)
--- End diff --
I didn't get you. Can you please elaborate ?
With this PR, here is the behavior:
bucketed table created using dataframe API
```
scala>
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j",
"k").sortBy("j", "k").saveAsTable("spark_native1")
17/05/11 07:34:23 WARN HiveExternalCatalog: Persisting bucketed data source
table `default`.`spark_native1` into Hive metastore in Spark SQL specific
format, which is NOT compatible with Hive.
scala> spark.sql(" desc formatted table1 ").collect.foreach(println)
[# col_name,data_type,comment]
[i,int,null]
[j,int,null]
[k,string,null]
[,,]
[# Detailed Table Information,,]
[Database,default,]
[Table,table1,]
[Owner,tejasp,]
[Created,Fri Apr 14 17:35:51 PDT 2017,]
[Last Access,Wed Dec 31 16:00:00 PST 1969,]
[Type,MANAGED,]
[Provider,org.apache.spark.sql.hive.orc.OrcFileFormat,]
[Num Buckets,8,]
[Bucket Columns,[`j`, `k`],]
[Sort Columns,[`j`, `k`],]
[Properties,[serialization.format=1],]
[Statistics,2938 bytes, 16 rows,]
[Location,file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/table1,]
[Serde Library,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
[InputFormat,org.apache.hadoop.mapred.SequenceFileInputFormat,]
[OutputFormat,org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,]
# Hive CLI
hive> desc formatted spark_native1 ;
OK
# col_name data_type comment
col array<string> from deserializer
# Detailed Table Information
Database: default
Owner: tejasp
CreateTime: Thu May 11 07:34:23 PDT 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location:
file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/spark_native1
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE false
numFiles 8
numRows -1
rawDataSize -1
spark.sql.sources.provider
org.apache.spark.sql.hive.orc.OrcFileFormat
spark.sql.sources.schema.bucketCol.0 j
spark.sql.sources.schema.bucketCol.1 k
spark.sql.sources.schema.numBucketCols 2
spark.sql.sources.schema.numBuckets 8
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.numSortCols 2
spark.sql.sources.schema.part.0
{\"type\":\"struct\",\"fields\":[{\"name\":\"i\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"j\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"k\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
spark.sql.sources.schema.sortCol.0 j
spark.sql.sources.schema.sortCol.1 k
totalSize 2938
transient_lastDdlTime 1494513263
# Storage Information
SerDe Library:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
path
file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/spark_native1
serialization.format 1
Time taken: 0.057 seconds, Fetched: 42 row(s)
```
bucketed table created via DDL
```
scala> hc.sql("""
| CREATE TABLE `bucket_table_spark_created`(`user_id` int,`name`
string)
| PARTITIONED BY (`ds` string)
| CLUSTERED BY (user_id)
| SORTED BY (user_id ASC)
| INTO 8 BUCKETS
| """)
res2: org.apache.spark.sql.DataFrame = []
scala> hc.sql(""" desc formatted bucket_table_spark_created
""").collect.foreach(println)
[# col_name,data_type,comment]
[user_id,int,null]
[name,string,null]
[ds,string,null]
[# Partition Information,,]
[# col_name,data_type,comment]
[ds,string,null]
[,,]
[# Detailed Table Information,,]
[Database,default,]
[Table,bucket_table_spark_created,]
[Owner,tejasp,]
[Created,Thu May 11 07:48:07 PDT 2017,]
[Last Access,Wed Dec 31 16:00:00 PST 1969,]
[Type,MANAGED,]
[Provider,hive,]
[Num Buckets,8,]
[Bucket Columns,[`user_id`],]
[Sort Columns,[`user_id`],]
[Properties,[serialization.format=1],]
[Location,file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/bucket_table_spark_created,]
[Serde Library,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
[InputFormat,org.apache.hadoop.mapred.TextInputFormat,]
[OutputFormat,org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,]
[Partition Provider,Catalog,]
# Hive CLI
hive> desc formatted bucket_table_spark_created ;
OK
# col_name data_type comment
user_id int
name string
# Partition Information
# col_name data_type comment
ds string
# Detailed Table Information
Database: default
Owner: tejasp
CreateTime: Thu May 11 07:48:07 PDT 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location:
file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/bucket_table_spark_created
Table Type: MANAGED_TABLE
Table Parameters:
spark.sql.sources.schema.bucketCol.0 user_id
spark.sql.sources.schema.numBucketCols 1
spark.sql.sources.schema.numBuckets 8
spark.sql.sources.schema.numPartCols 1
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.numSortCols 1
spark.sql.sources.schema.part.0
{\"type\":\"struct\",\"fields\":[{\"name\":\"user_id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ds\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
spark.sql.sources.schema.partCol.0 ds
spark.sql.sources.schema.sortCol.0 user_id
transient_lastDdlTime 1494514087
# Storage Information
SerDe Library:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: 8
Bucket Columns: [user_id]
Sort Columns: [Order(col:user_id, order:1)]
Storage Desc Params:
serialization.format 1
Time taken: 1.169 seconds, Fetched: 41 row(s)
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]