[
https://issues.apache.org/jira/browse/HUDI-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
pengzhiwei reassigned HUDI-1415:
--------------------------------
Assignee: pengzhiwei
> Incorrect query result for hudi table when using spark sql
> ----------------------------------------------------------
>
> Key: HUDI-1415
> URL: https://issues.apache.org/jira/browse/HUDI-1415
> Project: Apache Hudi
> Issue Type: Bug
> Components: Spark Integration
> Reporter: pengzhiwei
> Assignee: pengzhiwei
> Priority: Major
> Fix For: 0.6.1
>
>
> Currently hudi can sync the meta data to hive meta store using HiveSyncTool.
> The table description synced to hive just like this:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_insert0`(
> `_hoodie_commit_time` string,
> `_hoodie_commit_seqno` string,
> `_hoodie_record_key` string,
> `_hoodie_partition_path` string,
> `_hoodie_file_name` string,
> `id` int,
> `name` string,
> `version` int,
> `dt` string)
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
> 'file:/tmp/hudi/tbl_price_insert0'
> TBLPROPERTIES (
> 'last_commit_time_sync'='20201124105009',
> 'transient_lastDdlTime'='1606186222')
> {code}
> When we query this table using spark sql, spark sql trait it as a Hive Table
> and convert it to parquet LogicalRelation in
> HiveStrategies#RelationConversions. This may lead to an incorrect query
> result.
> Inorder to query hudi table correctly in spark sql, more table properties and
> serde properties must be added to the hive meta,just like the follow:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_cow0`(
> `_hoodie_commit_time` string,
> `_hoodie_commit_seqno` string,
> `_hoodie_record_key` string,
> `_hoodie_partition_path` string,
> `_hoodie_file_name` string,
> `id` int,
> `name` string,
> `version` int)
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
> 'path'='/tmp/hudi/tbl_price_cow0')
> STORED AS INPUTFORMAT
> 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
> 'file:/tmp/hudi/tbl_price_cow0'
> TBLPROPERTIES (
> 'last_commit_time_sync'='20201124120532',
> 'spark.sql.sources.provider'='hudi',
> 'spark.sql.sources.schema.numParts'='1',
>
> 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}',
>
> 'transient_lastDdlTime'='1606190729')
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)