[
https://issues.apache.org/jira/browse/HUDI-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
pengzhiwei updated HUDI-1415:
-----------------------------
Issue Type: Improvement (was: Bug)
> Read Hoodie Table As Spark DataSource Table
> --------------------------------------------
>
> Key: HUDI-1415
> URL: https://issues.apache.org/jira/browse/HUDI-1415
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Spark Integration
> Affects Versions: 0.9.0
> Reporter: pengzhiwei
> Assignee: pengzhiwei
> Priority: Major
> Labels: pull-request-available, user-support-issues
> Fix For: 0.9.0
>
>
> If we update a hudi table twice more, we will get an incorrect query count by
> spark sql.
>
> Currently hudi can sync the meta data to hive meta store using HiveSyncTool.
> The table description synced to hive just like this:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_insert0`(
> `_hoodie_commit_time` string,
> `_hoodie_commit_seqno` string,
> `_hoodie_record_key` string,
> `_hoodie_partition_path` string,
> `_hoodie_file_name` string,
> `id` int,
> `name` string,
> `price` double,
> `version` int,
> `dt` string)
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
> 'file:/tmp/hudi/tbl_price_insert0'
> TBLPROPERTIES (
> 'last_commit_time_sync'='20201124105009',
> 'transient_lastDdlTime'='1606186222')
> {code}
> When we query this table using spark sql, it trait it as a Hive Table, not a
> spark data source table and convert it to parquet LogicalRelation in
> HiveStrategies#RelationConversions. As a result, spark sql read the hudi
> table just like a parquet data source. This lead to an incorrect query
> result.
> Inorder to query hudi table correctly in spark sql, more table properties and
> serde properties must be added to the hive meta,just like the follow:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_cow0`(
> `_hoodie_commit_time` string,
> `_hoodie_commit_seqno` string,
> `_hoodie_record_key` string,
> `_hoodie_partition_path` string,
> `_hoodie_file_name` string,
> `id` int,
> `name` string,
> `price` double,
> `version` int)
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
> 'path'='/tmp/hudi/tbl_price_cow0')
> STORED AS INPUTFORMAT
> 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
> 'file:/tmp/hudi/tbl_price_cow0'
> TBLPROPERTIES (
> 'last_commit_time_sync'='20201124120532',
> 'spark.sql.sources.provider'='hudi',
> 'spark.sql.sources.schema.numParts'='1',
>
> 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}',
>
> 'transient_lastDdlTime'='1606190729')
> {code}
> These are the missing table properties:
> {code:java}
> spark.sql.sources.provider= 'hudi'
> spark.sql.sources.schema.numParts = 'xx'
> spark.sql.sources.schema.part.{num} ='xx'
> spark.sql.sources.schema.numPartCols = 'xx'
> spark.sql.sources.schema.partCol.{num} = 'xx'{code}
> and serde property:
> {code:java}
> 'path'='/path/to/hudi'
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)