haripriyarhp opened a new issue, #8217:
URL: https://github.com/apache/hudi/issues/8217

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I have a spark structured streaming job writing to MoR in S3. Based on the 
docs here 
https://hudi.apache.org/docs/next/compaction#spark-structured-streaming , I 
have set the properties for async compaction. As per my understanding, the 
compaction runs asynchronously so that ingestion also takes place in parallel. 
But what I observed is that the spark assigns all the resources to the 
compaction job and only when it is finished, it continues with the ingestion 
even though both the jobs are running in parallel. Is there something like 
--delta-sync-scheduling-weight", "--compact-scheduling-weight", 
""--delta-sync-scheduling-minshare", and "--compact-scheduling-minshare" for 
spark structured streaming for ingestion and compaction to run in parallel with 
resource allocation ?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   I expect compaction and ingestion is happening in parallel
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.1.2
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : on spark operator on k8s
   
   Here is the screenshot of the micro batches for the spark structured 
streaming job. Normally each batch takes 4-6 mins. But the batches immediately 
following compaction is taking some time. 
   
![async_compaction](https://user-images.githubusercontent.com/109664817/225903090-f2a8aaaf-aa34-49d6-aa59-c7127849832c.PNG)
   
   Time taken for a normal batch = 206
   
![sensor_morbloom_124_details](https://user-images.githubusercontent.com/109664817/225904984-b13864e8-ddb3-4bd9-8f14-7330d05956f6.PNG)
   
   Batch=207 
   Here you can see the last job 2961 is HoodieCompactionPlanGenerator which is 
adding 8.6 mins overhead to this batch.
   
![sensor_morbloom_125_details](https://user-images.githubusercontent.com/109664817/225903294-18ee13ad-bafc-4f57-a820-cf7f9eb6237a.PNG)
   
   Batch=208
   Here you can see it starts with job 2963 but 2964-2972 is for compaction and 
you can see it continues with 2973 only after compaction jobs are finished even 
though it runs on a different thread(see below pic). And time taken for 
different stages like Load base files, Building workload profile and getting 
small files has drastically increased. Refer batch 206 for normal time taken. 
   
![sensor_morbloom_126_details](https://user-images.githubusercontent.com/109664817/225903283-db13427d-3338-44e0-93a2-e614451b6dd7.PNG)
   
   compaction jobs (2964-2972)
   
![sensor_morbloom_async_compaction](https://user-images.githubusercontent.com/109664817/225909669-6a356796-4a5e-459c-a001-e20f30caeac2.PNG)
   
   Is there something that can be done to improve this performance? 
   
   **Additional context**
   
   using hudi-spark3.1-bundle_2.12-0.13.0.jar along with hadoop-aws_3.1.2.jar 
and aws-java-sdk-bundle_1.11.271.jar. 
   
   job configs
   
    "hoodie.table.name" -> tableName,
         "path" -> "s3a://path/Hudi/".concat(tableName),
         "hoodie.datasource.write.table.name" -> tableName,
         "hoodie.datasource.write.table.type" -> MERGE_ON_READ,
         "hoodie.datasource.write.operation" -> "upsert",
         "hoodie.datasource.write.recordkey.field" -> "col5,col6,col7",
         "hoodie.datasource.write.partitionpath.field" -> "col1,col2,col3,col4",
         "hoodie.datasource.write.keygenerator.class" -> 
"org.apache.hudi.keygen.ComplexKeyGenerator",
         "hoodie.datasource.write.hive_style_partitioning" -> "true",
         //Cleaning options
         "hoodie.clean.automatic" -> "true",
         "hoodie.clean.max.commits" -> "3",
         //"hoodie.clean.async" -> "true",
         //hive_sync_options
         "hoodie.datasource.hive_sync.partition_fields" -> 
"col1,col2,col3,col4",
         "hoodie.datasource.hive_sync.database" -> dbName,
         "hoodie.datasource.hive_sync.table" -> tableName,
         "hoodie.datasource.hive_sync.enable" -> "true",
         "hoodie.datasource.hive_sync.mode" -> "hms",
         "hoodie.datasource.hive_sync.partition_extractor_class" -> 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
         "hoodie.upsert.shuffle.parallelism" -> "200",
         "hoodie.insert.shuffle.parallelism" -> "200",
         "hoodie.datasource.compaction.async.enable" -> true, 
         "hoodie.compact.inline.max.delta.commits" -> "10",
         "hoodie.index.type" -> "BLOOM"
      
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to