[
https://issues.apache.org/jira/browse/HUDI-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Danny Chen closed HUDI-8394.
----------------------------
Resolution: Fixed
Fixed via master branch: 63e014a6c0fd783818f380a6a84be58990abdecc
> MOR table behavior for Spark bulk insert to COW table with bucket index
> -----------------------------------------------------------------------
>
> Key: HUDI-8394
> URL: https://issues.apache.org/jira/browse/HUDI-8394
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Geser Dugarov
> Assignee: Geser Dugarov
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.0.1
>
>
> After bulk insert to COW table with bucket index, sometimes we have parquet
> and log files, like for MOR table.
> Prerequisites:
> * COW table,
> * bulk insert operation,
> * hoodie.datasource.write.row.writer.enable = false,
> * simple bucket index,
> * insert 2 rows, but generated hash should be the same for both.
>
> Could be reproduced by this test:
> {code:java}
> test("Test Bulk Insert Into Bucket Index Table") {
> withSQLConf("hoodie.datasource.write.operation" -> "bulk_insert",
> "hoodie.bulkinsert.shuffle.parallelism" -> "1") {
> Seq("cow").foreach { tableType =>
> Seq("false").foreach { bulkInsertAsRow =>
> withTempDir { tmp =>
> val tableName = generateTableName
> // Create a partitioned table
> spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | dt string,
> | name string,
> | price double,
> | ts long
> |) using hudi
> | tblproperties (
> | primaryKey = 'id,name',
> | type = '$tableType',
> | preCombineField = 'ts',
> | hoodie.index.type = 'BUCKET',
> | hoodie.index.bucket.engine = 'SIMPLE',
> | hoodie.bucket.index.num.buckets = '2',
> | hoodie.bucket.index.hash.field = 'id,name',
> | hoodie.datasource.write.row.writer.enable =
> '$bulkInsertAsRow')
> | partitioned by (dt)
> | location '${tmp.getCanonicalPath}'
> """.stripMargin)
> // Note: Do not write the field alias, the partition field must be
> placed last.
> spark.sql(
> s"""
> | insert into $tableName values
> | (5, 'a1,1', 10, 1000, "2021-01-05")
> """.stripMargin)
> spark.sql(
> s"""
> | insert into $tableName values
> | (9, 'a3,3', 30, 3000, "2021-01-05")
> """.stripMargin)
> )
> }
> }
> }
> }
> } {code}
>
> In a results we could check properties:
> {code:java}
> cat ./.hoodie/hoodie.properties
> # ...
> # hoodie.table.type=COPY_ON_WRITE <-- COW table
> # ...{code}
> But the list of data files is following:
> {code:java}
> ll ./dt\=2021-01-05/
> # total 456
> # drwxr-xr-x 2 d00838679 d00838679 4096 окт 19 15:33 ./
> # drwxrwxr-x 4 d00838679 d00838679 4096 окт 19 15:32 ../
> # -rw-r--r-- 1 d00838679 d00838679 435346 окт 19 15:32
> 00000001-4a79-47b3-918c-05f8b90e8b14-0_1-14-12_20241019083242289.parquet
> <-- base file
> # -rw-r--r-- 1 d00838679 d00838679 3412 окт 19 15:32
> .00000001-4a79-47b3-918c-05f8b90e8b14-0_1-14-12_20241019083242289.parquet.crc
> # -rw-r--r-- 1 d00838679 d00838679 978 окт 19 15:33
> .00000001-4a79-47b3-918c-05f8b90e8b14-0_20241019083307134.log.1_0-30-31
> <-- log file as for MOR table
> # -rw-r--r-- 1 d00838679 d00838679 16 окт 19 15:33
> ..00000001-4a79-47b3-918c-05f8b90e8b14-0_20241019083307134.log.1_0-30-31.crc
> # -rw-r--r-- 1 d00838679 d00838679 96 окт 19 15:32
> .hoodie_partition_metadata
> # -rw-r--r-- 1 d00838679 d00838679 12 окт 19 15:32
> ..hoodie_partition_metadata.crc {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)