haripriyarhp opened a new issue, #8217: URL: https://github.com/apache/hudi/issues/8217
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I have a spark structured streaming job writing to MoR in S3. Based on the docs here https://hudi.apache.org/docs/next/compaction#spark-structured-streaming , I have set the properties for async compaction. As per my understanding, the compaction runs asynchronously so that ingestion also takes place in parallel. But what I observed is that the spark assigns all the resources to the compaction job and only when it is finished, it continues with the ingestion even though both the jobs are running in parallel. Is there something like --delta-sync-scheduling-weight", "--compact-scheduling-weight", ""--delta-sync-scheduling-minshare", and "--compact-scheduling-minshare" for spark structured streaming for ingestion and compaction to run in parallel with resource allocation ? **To Reproduce** Steps to reproduce the behavior: 1. 2. 3. 4. **Expected behavior** I expect compaction and ingestion is happening in parallel **Environment Description** * Hudi version : 0.13.0 * Spark version : 3.1.2 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : on spark operator on k8s Here is the screenshot of the micro batches for the spark structured streaming job. Normally each batch takes 4-6 mins. But the batches immediately following compaction is taking some time.  Time taken for a normal batch = 206  Batch=207 Here you can see the last job 2961 is HoodieCompactionPlanGenerator which is adding 8.6 mins overhead to this batch.  Batch=208 Here you can see it starts with job 2963 but 2964-2972 is for compaction and you can see it continues with 2973 only after compaction jobs are finished even though it runs on a different thread(see below pic). And time taken for different stages like Load base files, Building workload profile and getting small files has drastically increased. Refer batch 206 for normal time taken.  compaction jobs (2964-2972)  Is there something that can be done to improve this performance? **Additional context** using hudi-spark3.1-bundle_2.12-0.13.0.jar along with hadoop-aws_3.1.2.jar and aws-java-sdk-bundle_1.11.271.jar. job configs "hoodie.table.name" -> tableName, "path" -> "s3a://path/Hudi/".concat(tableName), "hoodie.datasource.write.table.name" -> tableName, "hoodie.datasource.write.table.type" -> MERGE_ON_READ, "hoodie.datasource.write.operation" -> "upsert", "hoodie.datasource.write.recordkey.field" -> "col5,col6,col7", "hoodie.datasource.write.partitionpath.field" -> "col1,col2,col3,col4", "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.hive_style_partitioning" -> "true", //Cleaning options "hoodie.clean.automatic" -> "true", "hoodie.clean.max.commits" -> "3", //"hoodie.clean.async" -> "true", //hive_sync_options "hoodie.datasource.hive_sync.partition_fields" -> "col1,col2,col3,col4", "hoodie.datasource.hive_sync.database" -> dbName, "hoodie.datasource.hive_sync.table" -> tableName, "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.hive_sync.mode" -> "hms", "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.upsert.shuffle.parallelism" -> "200", "hoodie.insert.shuffle.parallelism" -> "200", "hoodie.datasource.compaction.async.enable" -> true, "hoodie.compact.inline.max.delta.commits" -> "10", "hoodie.index.type" -> "BLOOM" **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
