[jira] [Updated] (HUDI-5584) When the table to be synchronized already exists in hive, need to update serde/table properties

HunterXHunter (Jira) Wed, 18 Jan 2023 22:21:08 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


HunterXHunter updated HUDI-5584:
--------------------------------
    Description: 
when we set hoodie.datasource.hive_sync.table.strategy='ro', we expect only one 
table to be synchronized to hive without suffix _ro.

But sometimes tables have been created in hive early,

like:
{code:java}
create table hive.test.HUDI_5584 (
  id int,
 ts int)
 using hudi
 tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts',
  hoodie.datasource.hive_sync.enable = 'true',
hoodie.datasource.hive_sync.table.strategy='ro'
) location '/tmp/HUDI_5584'  {code}
and show create table .
{code:java}
CREATE EXTERNAL TABLE `hudi_5584`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `id` int,
  `ts` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'path'='file:///tmp/HUDI_5584')
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/HUDI_5584'
TBLPROPERTIES (
  'hoodie.datasource.hive_sync.enable'='true',
  'hoodie.datasource.hive_sync.table.strategy'='ro',
  'preCombineField'='ts',
  'primaryKey'='id',
  'spark.sql.create.version'='3.3.1',
  'spark.sql.sources.provider'='hudi',
  'spark.sql.sources.schema.numParts'='1',
  'spark.sql.sources.schema.part.0'='xx'
  'transient_lastDdlTime'='1674108302',
  'type'='mor') {code}
*The table like a realtime table.*

 

When we finish writing data and synchronize ro table , because the table 
already exists, so SERDEPROPERTIES and  OUTPUTFORMAT will not be modified.

This causes the type of the table is not match expect.

 

 

  was:
when we set hoodie.datasource.hive_sync.table.strategy='ro', we expect only one 
table to be synchronized to hive without suffix _ro.

But sometimes the table may have been created in hive early.

like:
{code:java}
create table hive.test.HUDI_5584 (
  id int,
 ts int)
 using hudi
 tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts',
  hoodie.datasource.hive_sync.enable = 'true',
hoodie.datasource.hive_sync.table.strategy='ro'
) location '/tmp/HUDI_5584'  {code}
and show create table .
{code:java}
CREATE EXTERNAL TABLE `hudi_5584`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `id` int,
  `ts` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'path'='file:///tmp/HUDI_5584')
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/tmp/HUDI_5584'
TBLPROPERTIES (
  'hoodie.datasource.hive_sync.enable'='true',
  'hoodie.datasource.hive_sync.table.strategy'='ro',
  'preCombineField'='ts',
  'primaryKey'='id',
  'spark.sql.create.version'='3.3.1',
  'spark.sql.sources.provider'='hudi',
  'spark.sql.sources.schema.numParts'='1',
  'spark.sql.sources.schema.part.0'='xx'
  'transient_lastDdlTime'='1674108302',
  'type'='mor') {code}
the table like a realtime table.

When we finish writing data and synchronize tables, because the table already 
exists, so SERDEPROPERTIES and  OUTPUTFORMAT will not be modified.

This causes the type of the table to be unexpected.

 

 


> When the table to be synchronized already exists in hive, need to update 
> serde/table properties
> -----------------------------------------------------------------------------------------------
>
>                 Key: HUDI-5584
>                 URL: https://issues.apache.org/jira/browse/HUDI-5584
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: HunterXHunter
>            Priority: Major
>
> when we set hoodie.datasource.hive_sync.table.strategy='ro', we expect only 
> one table to be synchronized to hive without suffix _ro.
> But sometimes tables have been created in hive early,
> like:
> {code:java}
> create table hive.test.HUDI_5584 (
>   id int,
>  ts int)
>  using hudi
>  tblproperties (
>   type = 'mor',
>   primaryKey = 'id',
>   preCombineField = 'ts',
>   hoodie.datasource.hive_sync.enable = 'true',
> hoodie.datasource.hive_sync.table.strategy='ro'
> ) location '/tmp/HUDI_5584'  {code}
> and show create table .
> {code:java}
> CREATE EXTERNAL TABLE `hudi_5584`(
>   `_hoodie_commit_time` string,
>   `_hoodie_commit_seqno` string,
>   `_hoodie_record_key` string,
>   `_hoodie_partition_path` string,
>   `_hoodie_file_name` string,
>   `id` int,
>   `ts` int)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'path'='file:///tmp/HUDI_5584')
> STORED AS INPUTFORMAT
>   'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/HUDI_5584'
> TBLPROPERTIES (
>   'hoodie.datasource.hive_sync.enable'='true',
>   'hoodie.datasource.hive_sync.table.strategy'='ro',
>   'preCombineField'='ts',
>   'primaryKey'='id',
>   'spark.sql.create.version'='3.3.1',
>   'spark.sql.sources.provider'='hudi',
>   'spark.sql.sources.schema.numParts'='1',
>   'spark.sql.sources.schema.part.0'='xx'
>   'transient_lastDdlTime'='1674108302',
>   'type'='mor') {code}
> *The table like a realtime table.*
>  
> When we finish writing data and synchronize ro table , because the table 
> already exists, so SERDEPROPERTIES and  OUTPUTFORMAT will not be modified.
> This causes the type of the table is not match expect.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5584) When the table to be synchronized already exists in hive, need to update serde/table properties

Reply via email to