silly-carbon opened a new issue, #11875: URL: https://github.com/apache/hudi/issues/11875
**Describe the problem you faced** Spark Config: spark.driver.cores=1;spark.driver.memory=18g;spark.executor.cores=10;spark.executor.memory=32g;spark.driver.maxResultSize=8g;spark.default.parallelism=400;spark.sql.shuffle.partitions=400;spark.dynamicAllocation.maxExecutors=20;spark.executor.memoryOverhead=3g;spark.kryoserializer.buffer.max=1024m But HUDI spends many time on HoodieBloomIndex.tagLocation:  And With GC issues:  **To Reproduce** Steps to reproduce the behavior: 1. Create a table CREATE TABLE `temp_db`.`xxxxxxxxxxx` ( `_hoodie_is_deleted` BOOLEAN, `t_pre_combine_field` BIGINT, `order_type` INT, `order_no` INT, `profile_no` INT, `profile_type` STRING, `profile_cat` STRING, `u_version` STRING, `order_line_no` INT, `profile_c` STRING, `profile_i` INT, `profile_f` DECIMAL(20,8), `profile_d` TIMESTAMP, `active` STRING, `entry_datetime` TIMESTAMP, `entry_id` INT, `h_version` INT) USING hudi CLUSTERED BY (order_no, profile_type, profile_no, order_type, profile_cat) INTO 2 BUCKETS TBLPROPERTIES ( 'primaryKey' = 'order_no,profile_type,profile_no,order_type,profile_cat', 'hoodie.cleaner.policy.failed.writes' = 'LAZY', 'type' = 'cow', 'hoodie.write.lock.filesystem.expire' = '15', 'preCombineField' = 't_pre_combine_field', 'hoodie.write.lock.provider' = 'org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider', 'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control', 'hoodie.index.type' = 'BLOOM' ); 2. BULK_INSERT 1.6 Billion Data SET spark.sql.parquet.datetimeRebaseModeInWrite = CORRECTED; set hoodie.datasource.write.operation = bulk_insert; SET hoodie.combine.before.insert=false; INSERT OVERWRITE temp_db.xxxxxxxxxxxxx SELECT FALSE, 1, * FROM ods_us.xxxxxx_source; 3. Insert 1 million data INSERT INTO temp_db.xxxxxxxxx ( SELECT TRUE AS _hoodie_is_deleted, * -- 0 rows FROM ods_us.xxxxxxxx_dddd UNION ALL SELECT FALSE AS _hoodie_is_deleted,* -- 1 million rows FROM ods_us.xxxxxxxxx_stage ) 5. **Expected behavior** UPSERT quickly. **Environment Description** * Hudi version : hudi-spark3.2-bundle_2.12-0.14.1.jar * Spark version : 3.2 * Hive version : 3.0 * Hadoop version : 3.0 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
