[jira] [Updated] (HUDI-5584) When the table to be synchronized already exists in hive, need to update serde/table properties

HunterXHunter (Jira) Wed, 18 Jan 2023 22:21:07 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


HunterXHunter updated HUDI-5584:
--------------------------------
    Description: 
when we set hoodie.datasource.hive_sync.table.strategy='ro', we expect only one 
table to be synchronized to hive without suffix _ro.

But sometimes tables have been created in hive early,

like:
{code:java}
create table hive.test.HUDI_5584 (
  id int,
 ts int)
 using hudi
 tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts',
  hoodie.datasource.hive_sync.enable = 'true',
hoodie.datasource.hive_sync.table.strategy='ro'
) location '/tmp/HUDI_5584'  {code}
and show create table .
{code:java}
CREATE EXTERNAL TABLE `hudi_5584`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `id` int,
  `ts` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'path'='file:///tmp/HUDI_5584')
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/HUDI_5584'
TBLPROPERTIES (
  'hoodie.datasource.hive_sync.enable'='true',
  'hoodie.datasource.hive_sync.table.strategy'='ro',
  'preCombineField'='ts',
  'primaryKey'='id',
  'spark.sql.create.version'='3.3.1',
  'spark.sql.sources.provider'='hudi',
  'spark.sql.sources.schema.numParts'='1',
  'spark.sql.sources.schema.part.0'='xx'
  'transient_lastDdlTime'='1674108302',
  'type'='mor') {code}
*The table like a realtime table.*

 

When we finish writing data and synchronize ro table , because the table 
already exists, so SERDEPROPERTIES and  OUTPUTFORMAT will not be modified.

This causes the type of the table is not match as expect.

 

 

  was:
when we set hoodie.datasource.hive_sync.table.strategy='ro', we expect only one 
table to be synchronized to hive without suffix _ro.

But sometimes tables have been created in hive early,

like:
{code:java}
create table hive.test.HUDI_5584 (
  id int,
 ts int)
 using hudi
 tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts',
  hoodie.datasource.hive_sync.enable = 'true',
hoodie.datasource.hive_sync.table.strategy='ro'
) location '/tmp/HUDI_5584'  {code}
and show create table .
{code:java}
CREATE EXTERNAL TABLE `hudi_5584`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `id` int,
  `ts` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'path'='file:///tmp/HUDI_5584')
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/HUDI_5584'
TBLPROPERTIES (
  'hoodie.datasource.hive_sync.enable'='true',
  'hoodie.datasource.hive_sync.table.strategy'='ro',
  'preCombineField'='ts',
  'primaryKey'='id',
  'spark.sql.create.version'='3.3.1',
  'spark.sql.sources.provider'='hudi',
  'spark.sql.sources.schema.numParts'='1',
  'spark.sql.sources.schema.part.0'='xx'
  'transient_lastDdlTime'='1674108302',
  'type'='mor') {code}
*The table like a realtime table.*

 

When we finish writing data and synchronize ro table , because the table 
already exists, so SERDEPROPERTIES and  OUTPUTFORMAT will not be modified.

This causes the type of the table is not match expect.

 

 


> When the table to be synchronized already exists in hive, need to update 
> serde/table properties
> -----------------------------------------------------------------------------------------------
>
>                 Key: HUDI-5584
>                 URL: https://issues.apache.org/jira/browse/HUDI-5584
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: HunterXHunter
>            Priority: Major
>
> when we set hoodie.datasource.hive_sync.table.strategy='ro', we expect only 
> one table to be synchronized to hive without suffix _ro.
> But sometimes tables have been created in hive early,
> like:
> {code:java}
> create table hive.test.HUDI_5584 (
>   id int,
>  ts int)
>  using hudi
>  tblproperties (
>   type = 'mor',
>   primaryKey = 'id',
>   preCombineField = 'ts',
>   hoodie.datasource.hive_sync.enable = 'true',
> hoodie.datasource.hive_sync.table.strategy='ro'
> ) location '/tmp/HUDI_5584'  {code}
> and show create table .
> {code:java}
> CREATE EXTERNAL TABLE `hudi_5584`(
>   `_hoodie_commit_time` string,
>   `_hoodie_commit_seqno` string,
>   `_hoodie_record_key` string,
>   `_hoodie_partition_path` string,
>   `_hoodie_file_name` string,
>   `id` int,
>   `ts` int)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'path'='file:///tmp/HUDI_5584')
> STORED AS INPUTFORMAT
>   'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/HUDI_5584'
> TBLPROPERTIES (
>   'hoodie.datasource.hive_sync.enable'='true',
>   'hoodie.datasource.hive_sync.table.strategy'='ro',
>   'preCombineField'='ts',
>   'primaryKey'='id',
>   'spark.sql.create.version'='3.3.1',
>   'spark.sql.sources.provider'='hudi',
>   'spark.sql.sources.schema.numParts'='1',
>   'spark.sql.sources.schema.part.0'='xx'
>   'transient_lastDdlTime'='1674108302',
>   'type'='mor') {code}
> *The table like a realtime table.*
>  
> When we finish writing data and synchronize ro table , because the table 
> already exists, so SERDEPROPERTIES and  OUTPUTFORMAT will not be modified.
> This causes the type of the table is not match as expect.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5584) When the table to be synchronized already exists in hive, need to update serde/table properties

Reply via email to