[jira] [Created] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created

lizhiqiang (Jira) Wed, 26 Apr 2023 23:53:14 -0700

lizhiqiang created HUDI-6144:
--------------------------------

             Summary: [Spark][Flink]bucket index and then insert data in bulk, 
the correct file cannot be created
                 Key: HUDI-6144
                 URL: https://issues.apache.org/jira/browse/HUDI-6144
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: lizhiqiang
         Attachments: image-2023-04-27-14-49-12-731.png


When I use bucket index and then insert data in bulk, the correct file cannot 
be created, and the prefix of the file cannot be replaced with the bucket ID.


I have an idea 
1. When creating a table, all files are created, and the number of files is 
equal to the number of buckets. And replace the prefix of the file with the 
bucket id. 
2. Build a hash table in memory, the key of this hash table corresponds to the 
bucket ID, and maps to the path of the file, the value is cached in the hash 
table first, and when the configured threshold is reached, you can flush the 
key mapped file. 
3. This part of the value below the hash table can be sorted in memory first.

1. create table and insert data

```sql
create table xxx.B (
id int,
name string,
price double,
ts long,
dt string
) using hudi
tblproperties (
type = 'mor',
primaryKey = 'id',
preCombineField = 'ts',
hoodie.index.type = 'BUCKET',
hoodie.bucket.index.num.buckets = '4'
);
 
insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');
```
2. Insert data at the same time as creating a table, the default is bulk insert
```java
-- create table and insert some data.
create table xxx.A using hudi
tblproperties (
type = 'mor',
primaryKey = 'id',
preCombineField = 'ts',
hoodie.index.type = 'BUCKET',
hoodie.sql.bulk.insert.enable= 'false',
hoodie.datasource.write.operation = 'upsert',
hoodie.bucket.index.num.buckets = '4'
) as select id,name,price,ts,dt from xxx.B;
```
-- default is bulk insert.
3. the prefix of the file cannot be replaced with the bucket ID
!image-2023-04-27-14-49-12-731.png!
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created

Reply via email to