[
https://issues.apache.org/jira/browse/HUDI-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-6144:
---------------------------------
Fix Version/s: (was: 1.0.0)
> [Spark][Flink]bucket index and then insert data in bulk, the correct file
> cannot be created
> -------------------------------------------------------------------------------------------
>
> Key: HUDI-6144
> URL: https://issues.apache.org/jira/browse/HUDI-6144
> Project: Apache Hudi
> Issue Type: Bug
> Components: flink-sql, spark-sql
> Affects Versions: 0.14.0
> Reporter: lizhiqiang
> Priority: Blocker
> Fix For: 0.14.0
>
> Attachments: image-2023-04-27-14-49-12-731.png
>
>
> When I use bucket index and then insert data in bulk, the correct file cannot
> be created, and the prefix of the file cannot be replaced with the bucket ID.
> I have an idea
> 1. When creating a table, all files are created, and the number of files is
> equal to the number of buckets. And replace the prefix of the file with the
> bucket id.
> 2. Build a hash table in memory, the key of this hash table corresponds to
> the bucket ID, and maps to the path of the file, the value is cached in the
> hash table first, and when the configured threshold is reached, you can flush
> the key mapped file.
> 3. This part of the value below the hash table can be sorted in memory first.
> 1. create table and insert data
>
> {code:java}
> create table xxx.B (
> id int,
> name string,
> price double,
> ts long,
> dt string
> ) using hudi
> tblproperties (
> type = 'mor',
> primaryKey = 'id',
> preCombineField = 'ts',
> hoodie.index.type = 'BUCKET',
> hoodie.bucket.index.num.buckets = '4'
> );
>
> insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code}
> 2. Insert data at the same time as creating a table, the default is bulk
> insert
> {code:java}
> – create table and insert some data.
> create table xxx.A using hudi
> tblproperties (
> type = 'mor',
> primaryKey = 'id',
> preCombineField = 'ts',
> hoodie.index.type = 'BUCKET',
> hoodie.sql.bulk.insert.enable= 'false',
> hoodie.datasource.write.operation = 'upsert',
> hoodie.bucket.index.num.buckets = '4'
> ) as select id,name,price,ts,dt from xxx.B;{code}
> – default is bulk insert.
> 3. the prefix of the file cannot be replaced with the bucket ID
> !image-2023-04-27-14-49-12-731.png!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)