[jira] [Updated] (HUDI-5155) hive reading rt table will get duplicate record

wangwenli (Jira) Thu, 03 Nov 2022 00:22:06 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


wangwenli updated HUDI-5155:
----------------------------
    Description: 
hive read mor rt table, will get duplicated record in below case:
 # using bucket index type
 # say primary key 1 - 100,  set bucket number to 1
 # insert 1 - 100 record ,compact it , one parquet file will be generated
 # insert 1 - 100 record once again, but dont't compact it, so the data file 
will contain 1 parquet file + 1 log file.
 # select * from table where key=1,  you will get 2 record.

the cause is  :

  in HoodieMergeOnReadtableInputFormat ,the isSplitable will return true, it 
will generate two map task, each task include the log file,  so each task will 
return one record.

pls refer this:

https://github.com/apache/hudi/issues/4618

  was:
hive read mor rt table, will get duplicated record in below case:
 # using bucket index type
 # say primary key 1 - 100,  set bucket number to 1
 # insert 1 - 100 record ,compact it , one parquet file will be generated
 # insert 1 - 100 record once again, but dont't compact it, so the data file 
will contain 1 parquet file + 1 log file.
 # select * from table where key=1,  you will get 2 record.

the cause is  :

  in HoodieMergeOnReadtableInputFormat ,the isSplitable will return true, it 
will generate two map task, each task include the log file,  so each task will 
return one record.


> hive reading rt table will get duplicate record
> -----------------------------------------------
>
>                 Key: HUDI-5155
>                 URL: https://issues.apache.org/jira/browse/HUDI-5155
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: hive
>    Affects Versions: 0.11.0
>            Reporter: wangwenli
>            Priority: Major
>
> hive read mor rt table, will get duplicated record in below case:
>  # using bucket index type
>  # say primary key 1 - 100,  set bucket number to 1
>  # insert 1 - 100 record ,compact it , one parquet file will be generated
>  # insert 1 - 100 record once again, but dont't compact it, so the data file 
> will contain 1 parquet file + 1 log file.
>  # select * from table where key=1,  you will get 2 record.
> the cause is  :
>   in HoodieMergeOnReadtableInputFormat ,the isSplitable will return true, it 
> will generate two map task, each task include the log file,  so each task 
> will return one record.
> pls refer this:
> https://github.com/apache/hudi/issues/4618



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5155) hive reading rt table will get duplicate record

Reply via email to