[jira] [Updated] (HUDI-1415) Incorrect query result for hudi table when using spark sql

pengzhiwei (Jira) Mon, 23 Nov 2020 22:05:34 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


pengzhiwei updated HUDI-1415:
-----------------------------
    Description: 
Currently hudi can sync the meta data to hive meta store using HiveSyncTool. 
The table description  synced to hive  just like this:
{code:java}
CREATE EXTERNAL TABLE `tbl_price_insert0`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `id` int, 
  `name` string, 
  `version` int, 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/hudi/tbl_price_insert0'
TBLPROPERTIES (
  'last_commit_time_sync'='20201124105009', 
  'transient_lastDdlTime'='1606186222')
{code}
When we query this table using spark sql, spark sql trait it as a Hive Table 
and convert it to parquet LogicalRelation in 
HiveStrategies#RelationConversions. This may lead to an incorrect query result.

Inorder to query hudi table correctly in spark sql, more table properties and 
serde properties must be added to the hive meta,just like the follow:
{code:java}
CREATE EXTERNAL TABLE `tbl_price_cow0`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `id` int, 
  `name` string, 
  `version` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'path'='/tmp/hudi/tbl_price_cow0') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/hudi/tbl_price_cow0'
TBLPROPERTIES (
  'last_commit_time_sync'='20201124120532', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numParts'='1', 
  
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}',
 
  'transient_lastDdlTime'='1606190729')
{code}
These are the missing table properties:
{code:java}
spark.sql.sources.provider= 'hudi'
spark.sql.sources.schema.numParts = 'xx'
spark.sql.sources.schema.part.{num} ='xx'
spark.sql.sources.schema.numPartCols = 'xx'
spark.sql.sources.schema.partCol.{num} = 'xx'{code}
and serde property:
{code:java}
'path'='/path/to/hudi'
{code}

  was:
Currently hudi can sync the meta data to hive meta store using HiveSyncTool. 
The table description  synced to hive  just like this:
{code:java}
CREATE EXTERNAL TABLE `tbl_price_insert0`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `id` int, 
  `name` string, 
  `version` int, 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/hudi/tbl_price_insert0'
TBLPROPERTIES (
  'last_commit_time_sync'='20201124105009', 
  'transient_lastDdlTime'='1606186222')
{code}
When we query this table using spark sql, spark sql trait it as a Hive Table 
and convert it to parquet LogicalRelation in 
HiveStrategies#RelationConversions. This may lead to an incorrect query result.

Inorder to query hudi table correctly in spark sql, more table properties and 
serde properties must be added to the hive meta,just like the follow:
{code:java}
CREATE EXTERNAL TABLE `tbl_price_cow0`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `id` int, 
  `name` string, 
  `version` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'path'='/tmp/hudi/tbl_price_cow0') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/hudi/tbl_price_cow0'
TBLPROPERTIES (
  'last_commit_time_sync'='20201124120532', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numParts'='1', 
  
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}',
 
  'transient_lastDdlTime'='1606190729')
{code}


> Incorrect query result for hudi table when using spark sql
> ----------------------------------------------------------
>
>                 Key: HUDI-1415
>                 URL: https://issues.apache.org/jira/browse/HUDI-1415
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: pengzhiwei
>            Assignee: pengzhiwei
>            Priority: Major
>             Fix For: 0.6.1
>
>
> Currently hudi can sync the meta data to hive meta store using HiveSyncTool. 
> The table description  synced to hive  just like this:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_insert0`(
>   `_hoodie_commit_time` string, 
>   `_hoodie_commit_seqno` string, 
>   `_hoodie_record_key` string, 
>   `_hoodie_partition_path` string, 
>   `_hoodie_file_name` string, 
>   `id` int, 
>   `name` string, 
>   `version` int, 
>   `dt` string)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/hudi/tbl_price_insert0'
> TBLPROPERTIES (
>   'last_commit_time_sync'='20201124105009', 
>   'transient_lastDdlTime'='1606186222')
> {code}
> When we query this table using spark sql, spark sql trait it as a Hive Table 
> and convert it to parquet LogicalRelation in 
> HiveStrategies#RelationConversions. This may lead to an incorrect query 
> result.
> Inorder to query hudi table correctly in spark sql, more table properties and 
> serde properties must be added to the hive meta,just like the follow:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_cow0`(
>   `_hoodie_commit_time` string, 
>   `_hoodie_commit_seqno` string, 
>   `_hoodie_record_key` string, 
>   `_hoodie_partition_path` string, 
>   `_hoodie_file_name` string, 
>   `id` int, 
>   `name` string, 
>   `version` int)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> WITH SERDEPROPERTIES ( 
>   'path'='/tmp/hudi/tbl_price_cow0') 
> STORED AS INPUTFORMAT 
>   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/hudi/tbl_price_cow0'
> TBLPROPERTIES (
>   'last_commit_time_sync'='20201124120532', 
>   'spark.sql.sources.provider'='hudi', 
>   'spark.sql.sources.schema.numParts'='1', 
>   
> 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}',
>  
>   'transient_lastDdlTime'='1606190729')
> {code}
> These are the missing table properties:
> {code:java}
> spark.sql.sources.provider= 'hudi'
> spark.sql.sources.schema.numParts = 'xx'
> spark.sql.sources.schema.part.{num} ='xx'
> spark.sql.sources.schema.numPartCols = 'xx'
> spark.sql.sources.schema.partCol.{num} = 'xx'{code}
> and serde property:
> {code:java}
> 'path'='/path/to/hudi'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1415) Incorrect query result for hudi table when using spark sql

Reply via email to