[jira] [Updated] (HUDI-1415) Read Hoodie Table As Spark DataSource Table

pengzhiwei (Jira) Mon, 19 Apr 2021 21:00:16 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


pengzhiwei updated HUDI-1415:
-----------------------------
    Description: 
 Currently hudi can sync the meta data to hive meta store using HiveSyncTool. 
The table description  synced to hive  just like this:
{code:java}
CREATE EXTERNAL TABLE `tbl_price_insert0`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `id` int, 
  `name` string, 
  `price` double,
  `version` int, 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/hudi/tbl_price_insert0'
TBLPROPERTIES (
  'last_commit_time_sync'='20201124105009', 
  'transient_lastDdlTime'='1606186222')
{code}
When we query this table using spark sql, it trait it as a Hive Table, not a 
spark data source table and convert it to parquet LogicalRelation in 
HiveStrategies#RelationConversions. As a result, spark sql read the hudi table 
just like a parquet data source.  This lead to an incorrect query result if 
user missing set the spark.sql.hive.convertMetastoreParquet=false.

Inorder to query hudi table as data source table in spark, more table 
properties and serde properties must be added to the hive meta,just like the 
follow:
{code:java}
CREATE EXTERNAL TABLE `tbl_price_cow0`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `id` int, 
  `name` string, 
  `price` double,
  `version` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'path'='/tmp/hudi/tbl_price_cow0') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/hudi/tbl_price_cow0'
TBLPROPERTIES (
  'last_commit_time_sync'='20201124120532', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numParts'='1', 
  
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}',
 
  'transient_lastDdlTime'='1606190729')
{code}
These are the missing table properties:
{code:java}
spark.sql.sources.provider= 'hudi'
spark.sql.sources.schema.numParts = 'xx'
spark.sql.sources.schema.part.{num} ='xx'
spark.sql.sources.schema.numPartCols = 'xx'
spark.sql.sources.schema.partCol.{num} = 'xx'{code}
and serde property:
{code:java}
'path'='/path/to/hudi'
{code}

  was:
If we update a hudi table twice more, we will get an incorrect query count by 
spark sql.

 

Currently hudi can sync the meta data to hive meta store using HiveSyncTool. 
The table description  synced to hive  just like this:
{code:java}
CREATE EXTERNAL TABLE `tbl_price_insert0`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `id` int, 
  `name` string, 
  `price` double,
  `version` int, 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/hudi/tbl_price_insert0'
TBLPROPERTIES (
  'last_commit_time_sync'='20201124105009', 
  'transient_lastDdlTime'='1606186222')
{code}
When we query this table using spark sql, it trait it as a Hive Table, not a 
spark data source table and convert it to parquet LogicalRelation in 
HiveStrategies#RelationConversions. As a result, spark sql read the hudi table 
just like a parquet data source.  This lead to an incorrect query result.

Inorder to query hudi table correctly in spark sql, more table properties and 
serde properties must be added to the hive meta,just like the follow:
{code:java}
CREATE EXTERNAL TABLE `tbl_price_cow0`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `id` int, 
  `name` string, 
  `price` double,
  `version` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'path'='/tmp/hudi/tbl_price_cow0') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/hudi/tbl_price_cow0'
TBLPROPERTIES (
  'last_commit_time_sync'='20201124120532', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numParts'='1', 
  
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}',
 
  'transient_lastDdlTime'='1606190729')
{code}
These are the missing table properties:
{code:java}
spark.sql.sources.provider= 'hudi'
spark.sql.sources.schema.numParts = 'xx'
spark.sql.sources.schema.part.{num} ='xx'
spark.sql.sources.schema.numPartCols = 'xx'
spark.sql.sources.schema.partCol.{num} = 'xx'{code}
and serde property:
{code:java}
'path'='/path/to/hudi'
{code}


> Read Hoodie Table As Spark DataSource Table 
> --------------------------------------------
>
>                 Key: HUDI-1415
>                 URL: https://issues.apache.org/jira/browse/HUDI-1415
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Spark Integration
>    Affects Versions: 0.9.0
>            Reporter: pengzhiwei
>            Assignee: pengzhiwei
>            Priority: Major
>              Labels: pull-request-available, user-support-issues
>             Fix For: 0.9.0
>
>
>  Currently hudi can sync the meta data to hive meta store using HiveSyncTool. 
> The table description  synced to hive  just like this:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_insert0`(
>   `_hoodie_commit_time` string, 
>   `_hoodie_commit_seqno` string, 
>   `_hoodie_record_key` string, 
>   `_hoodie_partition_path` string, 
>   `_hoodie_file_name` string, 
>   `id` int, 
>   `name` string, 
>   `price` double,
>   `version` int, 
>   `dt` string)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/hudi/tbl_price_insert0'
> TBLPROPERTIES (
>   'last_commit_time_sync'='20201124105009', 
>   'transient_lastDdlTime'='1606186222')
> {code}
> When we query this table using spark sql, it trait it as a Hive Table, not a 
> spark data source table and convert it to parquet LogicalRelation in 
> HiveStrategies#RelationConversions. As a result, spark sql read the hudi 
> table just like a parquet data source.  This lead to an incorrect query 
> result if user missing set the spark.sql.hive.convertMetastoreParquet=false.
> Inorder to query hudi table as data source table in spark, more table 
> properties and serde properties must be added to the hive meta,just like the 
> follow:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_cow0`(
>   `_hoodie_commit_time` string, 
>   `_hoodie_commit_seqno` string, 
>   `_hoodie_record_key` string, 
>   `_hoodie_partition_path` string, 
>   `_hoodie_file_name` string, 
>   `id` int, 
>   `name` string, 
>   `price` double,
>   `version` int)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> WITH SERDEPROPERTIES ( 
>   'path'='/tmp/hudi/tbl_price_cow0') 
> STORED AS INPUTFORMAT 
>   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/hudi/tbl_price_cow0'
> TBLPROPERTIES (
>   'last_commit_time_sync'='20201124120532', 
>   'spark.sql.sources.provider'='hudi', 
>   'spark.sql.sources.schema.numParts'='1', 
>   
> 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}',
>  
>   'transient_lastDdlTime'='1606190729')
> {code}
> These are the missing table properties:
> {code:java}
> spark.sql.sources.provider= 'hudi'
> spark.sql.sources.schema.numParts = 'xx'
> spark.sql.sources.schema.part.{num} ='xx'
> spark.sql.sources.schema.numPartCols = 'xx'
> spark.sql.sources.schema.partCol.{num} = 'xx'{code}
> and serde property:
> {code:java}
> 'path'='/path/to/hudi'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1415) Read Hoodie Table As Spark DataSource Table

Reply via email to