[jira] [Closed] (HUDI-8394) MOR table behavior for Spark bulk insert to COW table with bucket index

Danny Chen (Jira) Tue, 17 Dec 2024 18:45:17 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Danny Chen closed HUDI-8394.
----------------------------
    Resolution: Fixed

Fixed via master branch: 63e014a6c0fd783818f380a6a84be58990abdecc

> MOR table behavior for Spark bulk insert to COW table with bucket index
> -----------------------------------------------------------------------
>
>                 Key: HUDI-8394
>                 URL: https://issues.apache.org/jira/browse/HUDI-8394
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Geser Dugarov
>            Assignee: Geser Dugarov
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.0.1
>
>
> After bulk insert to COW table with bucket index, sometimes we have parquet 
> and log files, like for MOR table.
> Prerequisites:
>  * COW table,
>  * bulk insert operation,
>  * hoodie.datasource.write.row.writer.enable = false,
>  * simple bucket index,
>  * insert 2 rows, but generated hash should be the same for both.
>  
> Could be reproduced by this test:
> {code:java}
> test("Test Bulk Insert Into Bucket Index Table") {
>   withSQLConf("hoodie.datasource.write.operation" -> "bulk_insert", 
> "hoodie.bulkinsert.shuffle.parallelism" -> "1") {
>     Seq("cow").foreach { tableType =>
>       Seq("false").foreach { bulkInsertAsRow =>
>         withTempDir { tmp =>
>           val tableName = generateTableName
>           // Create a partitioned table
>           spark.sql(
>             s"""
>                 |create table $tableName (
>                 |  id int,
>                 |  dt string,
>                 |  name string,
>                 |  price double,
>                 |  ts long
>                 |) using hudi
>                 | tblproperties (
>                 | primaryKey = 'id,name',
>                 | type = '$tableType',
>                 | preCombineField = 'ts',
>                 | hoodie.index.type = 'BUCKET',
>                 | hoodie.index.bucket.engine = 'SIMPLE',
>                 | hoodie.bucket.index.num.buckets = '2',
>                 | hoodie.bucket.index.hash.field = 'id,name',
>                 | hoodie.datasource.write.row.writer.enable = 
> '$bulkInsertAsRow')
>                 | partitioned by (dt)
>                 | location '${tmp.getCanonicalPath}'
>                 """.stripMargin)
>           // Note: Do not write the field alias, the partition field must be 
> placed last.
>           spark.sql(
>             s"""
>                 | insert into $tableName values
>                 | (5, 'a1,1', 10, 1000, "2021-01-05")
>                 """.stripMargin)
>           spark.sql(
>             s"""
>                 | insert into $tableName values
>                 | (9, 'a3,3', 30, 3000, "2021-01-05")
>              """.stripMargin)
>           )
>         }
>       }
>     }
>   }
> }  {code}
>  
> In a results we could check properties:
> {code:java}
> cat ./.hoodie/hoodie.properties 
> # ...
> # hoodie.table.type=COPY_ON_WRITE       <-- COW table
> # ...{code}
> But the list of data files is following:
> {code:java}
> ll ./dt\=2021-01-05/
> # total 456
> # drwxr-xr-x 2 d00838679 d00838679   4096 окт 19 15:33 ./
> # drwxrwxr-x 4 d00838679 d00838679   4096 окт 19 15:32 ../
> # -rw-r--r-- 1 d00838679 d00838679 435346 окт 19 15:32 
> 00000001-4a79-47b3-918c-05f8b90e8b14-0_1-14-12_20241019083242289.parquet      
>          <-- base file
> # -rw-r--r-- 1 d00838679 d00838679   3412 окт 19 15:32 
> .00000001-4a79-47b3-918c-05f8b90e8b14-0_1-14-12_20241019083242289.parquet.crc
> # -rw-r--r-- 1 d00838679 d00838679    978 окт 19 15:33 
> .00000001-4a79-47b3-918c-05f8b90e8b14-0_20241019083307134.log.1_0-30-31       
>          <-- log file as for MOR table
> # -rw-r--r-- 1 d00838679 d00838679     16 окт 19 15:33 
> ..00000001-4a79-47b3-918c-05f8b90e8b14-0_20241019083307134.log.1_0-30-31.crc
> # -rw-r--r-- 1 d00838679 d00838679     96 окт 19 15:32 
> .hoodie_partition_metadata
> # -rw-r--r-- 1 d00838679 d00838679     12 окт 19 15:32 
> ..hoodie_partition_metadata.crc {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-8394) MOR table behavior for Spark bulk insert to COW table with bucket index

Reply via email to