abhisheksahani91 opened a new issue, #11739: URL: https://github.com/apache/hudi/issues/11739
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1. Disable Executing compaction as a part Glue deltastreamer job 2. Write a separate glue job for executing the compaction by reading the scheduled compactions **Expected behavior** For the compaction glue job we have provided 5 executors, we are expecting it to run parallel on multiple executors. but only one executor is running the compaction. **Environment Description** AWS GLUE 4.0 * Hudi version : 0.12.1 * Spark version : 3.3 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : s3 * Running on Docker? (yes/no) : no **Additional context** Problem Statement: We have a separate AWS GLUE job for executing scheduled compactions. Even if we have 5 executors for glue compaction job, the compaction is executing on only one executor `import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.MappingSpec import com.amazonaws.services.glue.errors.CallSite import com.amazonaws.services.glue.util.GlueArgParser import com.amazonaws.services.glue.util.Job import com.amazonaws.services.glue.util.JsonOptions import org.apache.spark.SparkContext import scala.collection.JavaConverters._ import org.apache.spark.sql.SparkSession import org.apache.spark.api.java.JavaSparkContext import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer import org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator import org.apache.hudi.utilities.UtilHelpers import org.apache.hudi.utilities.HoodieCompactor object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkSession = SparkSession.builder .appName("YourAppName") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.shuffle.partitions", "200") // Increase shuffle partitions for parallelism .config("spark.default.parallelism", "200") // Increase default parallelism .config("spark.dynamicAllocation.enabled", "true") .config("spark.shuffle.service.enabled","true") .getOrCreate() val glueContext: GlueContext = new GlueContext(spark.sparkContext) // Parsing command line arguments val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME", "BASE-PATH", "TABLE-NAME", "SCHEMA-FILE").toArray) // Initialize Glue job Job.init(args("JOB_NAME"), glueContext, args.asJava) val config = new HoodieCompactor.Config() config.basePath = "s3://datalake/hudi_data_lake_prod10/prod_datalake_latest_v3/" config.tableName = "prod_datalake_latest_v3" config.schemaFile = "s3://datalake/schema/latest-prod-schema_v1.avsc" // Create a JavaSparkContext using the existing SparkSession's SparkContext val javaSparkContext = new JavaSparkContext(spark.sparkContext) // Create an instance of HoodieCompactor val hoodieCompactor = new HoodieCompactor(javaSparkContext,config) // Invoke the doCompact method directly, passing the JavaSparkContext hoodieCompactor.compact(1); // Commit Glue job Job.commit() } }` [ <img width="1277" alt="Screenshot 2024-08-08 at 2 33 52 AM" src="https://github.com/user-attachments/assets/6fda0574-1383-41de-977c-932e8aafd62a"> <img width="1274" alt="Screenshot 2024-08-08 at 2 35 37 AM" src="https://github.com/user-attachments/assets/9e03e5c7-dbf1-46dd-b698-54a1801696b1"> <img width="1272" alt="Screenshot 2024-08-08 at 2 48 44 AM" src="https://github.com/user-attachments/assets/3c189cc2-b420-4bf1-ac06-db078d428100"> ](url) **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
