wangyum opened a new pull request #24486: [SPARK-27592][SQL] Write the data of 
table write information to metadata
URL: https://github.com/apache/spark/pull/24486
 
 
   ## What changes were proposed in this pull request?
   
   We hint Hive using incorrect 
**InputFormat**(`org.apache.hadoop.mapred.SequenceFileInputFormat`) to read 
Spark's **Parquet** datasource bucket table:
   ```sql
   spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) 
SORTED BY (c1) INTO 2 BUCKETS;
   2019-04-29 17:52:05 WARN  HiveExternalCatalog:66 - Persisting bucketed data 
source table `default`.`t` into Hive metastore in Spark SQL specific format, 
which is NOT compatible with Hive.
   spark-sql> DESC EXTENDED t;
   c1   int     NULL
   c2   int     NULL
   
   # Detailed Table Information
   Database     default
   Table        t
   Owner        yumwang
   Created Time Mon Apr 29 17:52:05 CST 2019
   Last Access  Thu Jan 01 08:00:00 CST 1970
   Created By   Spark 2.4.0
   Type MANAGED
   Provider     parquet
   Num Buckets  2
   Bucket Columns       [`c1`]
   Sort Columns [`c1`]
   Table Properties     [transient_lastDdlTime=1556531525]
   Location     file:/user/hive/warehouse/t
   Serde Library        org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
   InputFormat  org.apache.hadoop.mapred.SequenceFileInputFormat
   OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
   Storage Properties   [serialization.format=1]
   ```
   We can see incompatible information when creating the table:
   ```
   WARN  HiveExternalCatalog:66 - Persisting bucketed data source table 
`default`.`t` into Hive metastore in Spark SQL specific format, which is NOT 
compatible with Hive.
   ```
   But downstream don’t know the compatibility. I'd like to write the write 
information of this table to metadata so that each engine decides compatibility 
itself.
   
   ## How was this patch tested?
   
   unit tests
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to