[I] [SUPPORT] How to achieve parallelism for Offline compaction job utilising utility HoodieCompactor.java running on aws GLUE [hudi]

via GitHub Thu, 08 Aug 2024 01:43:14 -0700


abhisheksahani91 opened a new issue, #11739:
URL: https://github.com/apache/hudi/issues/11739


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Disable Executing compaction as a part Glue deltastreamer job
   2. Write a separate glue job for executing the compaction by reading the 
scheduled compactions
   
   
   **Expected behavior**
   For the compaction glue job we have provided 5 executors, we are expecting 
it to run parallel on multiple executors. but only one executor is running the 
compaction.
   
   
   **Environment Description**
   AWS GLUE 4.0 
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.3
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Problem Statement:
   We have a separate AWS GLUE job for executing scheduled compactions. Even if 
we have 5 executors for glue compaction job, the compaction is executing on 
only one executor
   `import com.amazonaws.services.glue.GlueContext
   import com.amazonaws.services.glue.MappingSpec
   import com.amazonaws.services.glue.errors.CallSite
   import com.amazonaws.services.glue.util.GlueArgParser
   import com.amazonaws.services.glue.util.Job
   import com.amazonaws.services.glue.util.JsonOptions
   import org.apache.spark.SparkContext
   import scala.collection.JavaConverters._
   import org.apache.spark.sql.SparkSession
   import org.apache.spark.api.java.JavaSparkContext
   import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
   import org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator
   import org.apache.hudi.utilities.UtilHelpers
   import org.apache.hudi.utilities.HoodieCompactor
   
   object GlueApp {
     def main(sysArgs: Array[String]): Unit = {
       val spark: SparkSession = SparkSession.builder
         .appName("YourAppName")
         .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
         .config("spark.sql.shuffle.partitions", "200")  // Increase shuffle 
partitions for parallelism
         .config("spark.default.parallelism", "200")  // Increase default 
parallelism
         .config("spark.dynamicAllocation.enabled", "true")
         .config("spark.shuffle.service.enabled","true")
         .getOrCreate()
       val glueContext: GlueContext = new GlueContext(spark.sparkContext)
   
       // Parsing command line arguments
       val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME", 
"BASE-PATH", "TABLE-NAME", "SCHEMA-FILE").toArray)
   
       // Initialize Glue job
       Job.init(args("JOB_NAME"), glueContext, args.asJava)
   
       
       val config = new HoodieCompactor.Config()
       config.basePath = 
"s3://datalake/hudi_data_lake_prod10/prod_datalake_latest_v3/"
       config.tableName = "prod_datalake_latest_v3"
       config.schemaFile = "s3://datalake/schema/latest-prod-schema_v1.avsc"
       
       // Create a JavaSparkContext using the existing SparkSession's 
SparkContext
       val javaSparkContext = new JavaSparkContext(spark.sparkContext)
   
       // Create an instance of HoodieCompactor
       val hoodieCompactor = new HoodieCompactor(javaSparkContext,config)
   
        // Invoke the doCompact method directly, passing the JavaSparkContext
       hoodieCompactor.compact(1);
   
       // Commit Glue job
       Job.commit()
     }
   }`
   [
   <img width="1277" alt="Screenshot 2024-08-08 at 2 33 52 AM" 
src="https://github.com/user-attachments/assets/6fda0574-1383-41de-977c-932e8aafd62a";>
   <img width="1274" alt="Screenshot 2024-08-08 at 2 35 37 AM" 
src="https://github.com/user-attachments/assets/9e03e5c7-dbf1-46dd-b698-54a1801696b1";>
   <img width="1272" alt="Screenshot 2024-08-08 at 2 48 44 AM" 
src="https://github.com/user-attachments/assets/3c189cc2-b420-4bf1-ac06-db078d428100";>
   ](url)
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] How to achieve parallelism for Offline compaction job utilising utility HoodieCompactor.java running on aws GLUE [hudi]

Reply via email to