[I] [SUPPORT] MOR table behavior for Spark bulk insert to COW [hudi]

via GitHub Mon, 21 Oct 2024 03:36:28 -0700


geserdugarov opened a new issue, #12133:
URL: https://github.com/apache/hudi/issues/12133


   I've already created an issue HUDI-8394, but what to highlight this problem 
here.
   I suppose, this is a critical issue with current master in the case, when:
   - bulk insert operation,
   - `hoodie.datasource.write.row.writer.enable = false`,
   - simple bucket index.
   
   **Describe the problem you faced**
   
   When I try to bulk insert to COW table, I see in file system parquet and log 
files, which is MOR table behavior.
   
   ``` Bash
   cat ./.hoodie/hoodie.properties 
   # ...
   # hoodie.table.type=COPY_ON_WRITE       <-- COW table
   # ...
   ```
   
   ``` Bash
   ll ./dt\=2021-01-05/
   # total 456
   # drwxr-xr-x 2 d00838679 d00838679   4096 окт 19 15:33 ./
   # drwxrwxr-x 4 d00838679 d00838679   4096 окт 19 15:32 ../
   # -rw-r--r-- 1 d00838679 d00838679 435346 окт 19 15:32 
00000001-4a79-47b3-918c-05f8b90e8b14-0_1-14-12_20241019083242289.parquet        
       <-- base file
   # -rw-r--r-- 1 d00838679 d00838679   3412 окт 19 15:32 
.00000001-4a79-47b3-918c-05f8b90e8b14-0_1-14-12_20241019083242289.parquet.crc
   # -rw-r--r-- 1 d00838679 d00838679    978 окт 19 15:33 
.00000001-4a79-47b3-918c-05f8b90e8b14-0_20241019083307134.log.1_0-30-31         
       <-- log file as for MOR table
   # -rw-r--r-- 1 d00838679 d00838679     16 окт 19 15:33 
..00000001-4a79-47b3-918c-05f8b90e8b14-0_20241019083307134.log.1_0-30-31.crc
   # -rw-r--r-- 1 d00838679 d00838679     96 окт 19 15:32 
.hoodie_partition_metadata
   # -rw-r--r-- 1 d00838679 d00838679     12 окт 19 15:32 
..hoodie_partition_metadata.crc 
   ```
   
   **To Reproduce**
   
   To reproduce, existed test `Test Bulk Insert Into Bucket Index Table` could 
be used:
   
   ``` Java
   test("Test Bulk Insert Into Bucket Index Table") {
     withSQLConf("hoodie.datasource.write.operation" -> "bulk_insert", 
"hoodie.bulkinsert.shuffle.parallelism" -> "1") {
       withTempDir { tmp =>
         val tableName = generateTableName
         // Create a partitioned table
         spark.sql(
           s"""
               |create table $tableName (
               |  id int,
               |  dt string,
               |  name string,
               |  price double,
               |  ts long
               |) using hudi
               | tblproperties (
               | primaryKey = 'id,name',
               | type = 'cow',
               | preCombineField = 'ts',
               | hoodie.index.type = 'BUCKET',
               | hoodie.index.bucket.engine = 'SIMPLE',
               | hoodie.bucket.index.num.buckets = '2',
               | hoodie.bucket.index.hash.field = 'id,name',
               | hoodie.datasource.write.row.writer.enable = 'false')
               | partitioned by (dt)
               | location '${tmp.getCanonicalPath}'
               """.stripMargin)
         spark.sql(
           s"""
               | insert into $tableName values
               | (5, 'a1,1', 10, 1000, "2021-01-05")
               """.stripMargin)
         spark.sql(
           s"""
               | insert into $tableName values
               | (9, 'a3,3', 30, 3000, "2021-01-05")
            """.stripMargin)
         )
       }
     }
   }
   ```
   
   **Expected behavior**
   
   For COW table, only parquet files should be created.
   
   **Environment Description**
   
   * Hudi version : current master
   
   * Spark version : 3.5
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] MOR table behavior for Spark bulk insert to COW [hudi]

Reply via email to