[I] multi-writer jobs wait forever to finish it off (Using OPTIMISTIC_CONCURRENCY_CONTROL) [hudi]

via GitHub Tue, 09 Jan 2024 12:40:20 -0800


SamarthRaval opened a new issue, #10468:
URL: https://github.com/apache/hudi/issues/10468


   I am running multi-writer jobs on one hudi table.
   
   I am submitting 60-100ish files in parallel to write on hudi table, with all 
the concurrency set up configured.
   It is possible that many jobs are trying to update record in same file-id.
   
   Only few files are being successful and rest all are keep trying to finish 
but never completes.
   
   I see there are few rollbacks in between.
   I can also see so many commit.requested coming in but not able to finish it.
   
   Steps to reproduce the behavior:
   
   1. Can submit 100ish file in parallel.
   2. Trying to write in a same table.
   
   - I know we can use queue or something to submit job sequentially but I 
really want to utilize the concurrent job execution.
   
   **Expected behavior**
   
   Expected behaviour would be hudi will smart enough to handle 100ish file, 
and somehow finish it. Little bit of delay is expected but it should not take 
forever.
   
   **Environment Description**
   emr-6.15.0
   * Hudi version : 0.14.0
   
   * Spark version : 3.4.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Using below configuration to populate data in hudi:
   
   hoodie.datasource.write.insert.drop.duplicates -> true, 
   hoodie.datasource.hive_sync.database -> databaseName, 
   hoodie.combine.before.insert -> true, 
   hoodie.parquet.small.file.limit -> 104857600, 
   hoodie.datasource.hive_sync.mode -> hms, 
   hoodie.copyonwrite.record.size.estimate -> 50, 
   hoodie.datasource.hive_sync.support_timestamp -> true, 
   hoodie.datasource.write.precombine.field -> version, 
   hoodie.datasource.hive_sync.partition_fields -> partition, 
   hoodie.datasource.hive_sync.use_jdbc -> false, 
   hoodie.datasource.hive_sync.partition_extractor_class -> 
org.apache.hudi.hive.MultiPartKeysValueExtractor, 
   hoodie.meta.sync.metadata_file_listing -> true, 
   hoodie.parquet.max.file.size -> 125829120, 
   hoodie.cleaner.parallelism -> 1000, 
   hoodie.metadata.enable -> true, 
   hoodie.datasource.hive_sync.table -> tableName, 
   hoodie.clean.automatic -> true, 
   hoodie.datasource.write.operation -> insert, 
   hoodie.datasource.hive_sync.enable -> true, 
   hoodie.datasource.write.recordkey.field -> id, 
   hoodie.table.name -> tableName, 
   hoodie.write.lock.dynamodb.billing_mode -> PAY_PER_REQUEST, 
   hoodie.datasource.write.table.type -> COPY_ON_WRITE, 
   hoodie.datasource.write.hive_style_partitioning -> true, 
   hoodie.write.lock.dynamodb.endpoint_url -> url, 
   hoodie.write.lock.dynamodb.partition_key -> tableName, 
   hoodie.bulkinsert.sort.mode -> GLOBAL_SORT, 
   hoodie.datasource.hive_sync.auto_create_database -> true, 
   hoodie.cleaner.policy -> KEEP_LATEST_BY_HOURS, 
   hoodie.write.concurrency.early.conflict.detection.enable -> true, 
   hoodie.datasource.write.keygenerator.class -> 
org.apache.hudi.keygen.SimpleKeyGenerator, 
   hoodie.cleaner.policy.failed.writes -> LAZY, 
   hoodie.cleaner.hours.retained -> 120, 
   hoodie.write.lock.dynamodb.table -> lock-table-name, 
   hoodie.write.lock.provider -> 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider, 
   hoodie.datasource.write.partitionpath.field -> partition, 
   hoodie.write.concurrency.mode -> OPTIMISTIC_CONCURRENCY_CONTROL, 
   hoodie.write.lock.dynamodb.region -> us-east-1
   
   Timeline screenshots:
   
![image](https://github.com/apache/hudi/assets/8738019/f7bd7b72-5b15-4d84-9dea-cef6f19a672d)
   
![image](https://github.com/apache/hudi/assets/8738019/ea0d0614-3e6e-4910-ab75-aa2774aa4f3e)
   
   all jobs are keep adding commit.requested but never get finished and it 
takes forever to finish, which is not normal as generally one job only takes 
3-5 mins to finish up.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] multi-writer jobs wait forever to finish it off (Using OPTIMISTIC_CONCURRENCY_CONTROL) [hudi]

Reply via email to