[I] [SUPPORT] Huge Performance Issue With BLOOM Index On A 1.6 Billion COW Table [hudi]

via GitHub Tue, 03 Sep 2024 19:06:51 -0700


silly-carbon opened a new issue, #11875:
URL: https://github.com/apache/hudi/issues/11875


   **Describe the problem you faced**
   
   Spark Config: 
spark.driver.cores=1;spark.driver.memory=18g;spark.executor.cores=10;spark.executor.memory=32g;spark.driver.maxResultSize=8g;spark.default.parallelism=400;spark.sql.shuffle.partitions=400;spark.dynamicAllocation.maxExecutors=20;spark.executor.memoryOverhead=3g;spark.kryoserializer.buffer.max=1024m
   
   But HUDI spends many time on HoodieBloomIndex.tagLocation:
   
   
![image](https://github.com/user-attachments/assets/602a8827-426a-42fb-90fe-d6ee6152524f)
   
   
   And With GC issues:
   
![image](https://github.com/user-attachments/assets/4bfa0a0b-1815-4c09-9b7a-d66d01183b83)
   
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a table
   CREATE TABLE `temp_db`.`xxxxxxxxxxx` (
     `_hoodie_is_deleted` BOOLEAN,
     `t_pre_combine_field` BIGINT,
     `order_type` INT,
     `order_no` INT,
     `profile_no` INT,
     `profile_type` STRING,
     `profile_cat` STRING,
     `u_version` STRING,
     `order_line_no` INT,
     `profile_c` STRING,
     `profile_i` INT,
     `profile_f` DECIMAL(20,8),
     `profile_d` TIMESTAMP,
     `active` STRING,
     `entry_datetime` TIMESTAMP,
     `entry_id` INT,
     `h_version` INT)
   USING hudi
   CLUSTERED BY (order_no, profile_type, profile_no, order_type, profile_cat)
   INTO 2 BUCKETS
   TBLPROPERTIES (
     'primaryKey' = 'order_no,profile_type,profile_no,order_type,profile_cat',
     'hoodie.cleaner.policy.failed.writes' = 'LAZY',
     'type' = 'cow',
     'hoodie.write.lock.filesystem.expire' = '15',
     'preCombineField' = 't_pre_combine_field',
     'hoodie.write.lock.provider' = 
'org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider',
     'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control',
     'hoodie.index.type' = 'BLOOM'
    );
   2. BULK_INSERT 1.6 Billion Data
   
   SET spark.sql.parquet.datetimeRebaseModeInWrite = CORRECTED;
   
   set hoodie.datasource.write.operation = bulk_insert;
   
   SET hoodie.combine.before.insert=false;
    
   INSERT OVERWRITE temp_db.xxxxxxxxxxxxx
   SELECT FALSE, 1, * FROM ods_us.xxxxxx_source;
   
   3. Insert 1 million data
   INSERT INTO temp_db.xxxxxxxxx
   (
   SELECT TRUE AS _hoodie_is_deleted, *  -- 0 rows
   FROM ods_us.xxxxxxxx_dddd
   UNION ALL
   SELECT FALSE AS _hoodie_is_deleted,* -- 1 million rows
   FROM ods_us.xxxxxxxxx_stage
   )
   5.
   
   **Expected behavior**
   
   UPSERT quickly.
   
   **Environment Description**
   
   * Hudi version : hudi-spark3.2-bundle_2.12-0.14.1.jar
   
   * Spark version : 3.2
   
   * Hive version : 3.0
   
   * Hadoop version : 3.0
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Huge Performance Issue With BLOOM Index On A 1.6 Billion COW Table [hudi]

Reply via email to