[I] [SUPPORT] How to use DeltaStreamer on AWS Glue for Community Video and Blogs [hudi]

via GitHub Sun, 25 Feb 2024 07:34:01 -0800


soumilshah1995 opened a new issue, #10752:
URL: https://github.com/apache/hudi/issues/10752


   Greetings,
   
   I am currently engaged in developing a community video aimed at illustrating 
to users the advantages of utilizing DeltaStreamer on AWS Glue as opposed to 
EMR. AWS Glue, being serverless and lightweight, offers significant benefits 
for data processing tasks.
   
   # Step 1: Download Dataset and upload to S3 
   Link : 
https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link
   
   # Step 2:  Create Scala Job 
   ```
   
   import com.amazonaws.services.glue.GlueContext
   import com.amazonaws.services.glue.MappingSpec
   import com.amazonaws.services.glue.errors.CallSite
   import com.amazonaws.services.glue.util.GlueArgParser
   import com.amazonaws.services.glue.util.Job
   import com.amazonaws.services.glue.util.JsonOptions
   import org.apache.spark.SparkContext
   import scala.collection.JavaConverters._
   import org.apache.spark.sql.SparkSession
   import org.apache.spark.api.java.JavaSparkContext
   import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
   import org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator
   import org.apache.hudi.utilities.UtilHelpers
   
   object GlueApp {
     def main(sysArgs: Array[String]) {
       val args = GlueArgParser.getResolvedOptions(sysArgs, 
Seq("JOB_NAME").toArray)
   
       var config = Array(
         "--source-class", "org.apache.hudi.utilities.sources.ParquetDFSSource",
         "--source-ordering-field", "replicadmstimestamp",
         "--target-base-path", "s3://XX/test_silver/",
         "--target-table", "invoice",
         "--table-type" , "COPY_ON_WRITE",
         "--hoodie-conf", 
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator",
         "--hoodie-conf", "hoodie.datasource.write.recordkey.field=invoiceid",
         "--hoodie-conf", 
"hoodie.datasource.write.partitionpath.field=destinationstate",
         "--hoodie-conf", "hoodie.streamer.source.dfs.root=s3://XX/test/",
         "--hoodie-conf", 
"hoodie.datasource.write.precombine.field=replicadmstimestamp"
       )
   
       val cfg = HoodieDeltaStreamer.getConfig(config)
       val additionalSparkConfigs = 
SchedulerConfGenerator.getSparkSchedulingConfigs(cfg)
       val jssc = UtilHelpers.buildSparkContext("delta-streamer-test", "jes", 
additionalSparkConfigs)
       val spark = jssc.sc
   
       val glueContext: GlueContext = new GlueContext(spark)
       Job.init(args("JOB_NAME"), glueContext, args.asJava)
   
       try {
           new HoodieDeltaStreamer(cfg, jssc).sync();
       } finally {
           jssc.stop();
       }
   
       Job.commit()
     }
   }
   
   ```
   
   I've attempted to implement the provided code, but I'm encountering 
difficulties in getting it to function correctly. Here are the approaches I've 
tried:
   
   
   
   #### Approach 1 :  Use Glue 4.0 with Flags 
   ```
   
   --datalake-formats | hudi
   
   --conf  |  spark.serializer=org.apache.spark.serializer.KryoSerializer  
--conf spark.sql.hive.convertMetastoreParquet=false --conf 
spark.sql.hive.convertMetastoreParquet=false --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
--conf spark.sql.legacy.pathOptionBehavior.enabled=true --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
   
   ```
   Unfortunately, this approach did not yield the desired outcome.
   
   
   
   I also attempted to upload custom JAR files to S3 and utilize them, but 
encountered similar issues.
   
   
   
![image](https://github.com/apache/hudi/assets/39345855/171864c1-b37d-4c96-9d8f-c199fa770619)
   
   
   I would greatly appreciate any assistance or insights regarding the correct 
utilization of DeltaStreamer on AWS Glue.
   
   Thank you for your attention and support.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] How to use DeltaStreamer on AWS Glue for Community Video and Blogs [hudi]

Reply via email to