[GitHub] [hudi] harishraju-govindaraju opened a new issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

GitBox Wed, 19 Jan 2022 11:37:53 -0800


harishraju-govindaraju opened a new issue #4641:
URL: https://github.com/apache/hudi/issues/4641



   
   **Describe the problem you faced**
   I started an EMR Cluster and trying to run deltastreamer of HUDI. However, i 
get an error 👍 
   **Failed to load org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.
   java.lang.ClassNotFoundException: 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer**
   
   I was trying to follow this documentation and do the steps. 
   
   https://hudi.apache.org/blog/2021/08/23/s3-events-source
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   
   # To start S3EventsSource
   
   spark-submit \
   --jars 
"/home/hadoop/hudi-utilities-bundle_2.11-0.9.0.jar,/usr/lib/spark/external/lib/spark-avro.jar,/home/hadoop/aws-java-sdk-sqs-1.12.22.jar"
 \
   --master yarn --deploy-mode client \
   --class "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer" 
/home/hadoop/hudi-packages/hudi-utilities-bundle_2.11-0.9.0-SNAPSHOT.jar \
   --table-type COPY_ON_WRITE --source-ordering-field eventTime \
   --target-base-path s3://s3-eip-dev-uea1-hudipoc-001/hudi-trusted/metadata/ \
   --target-table s3_meta_table  --continuous \
   --min-sync-interval-seconds 10 \
   --hoodie-conf 
hoodie.datasource.write.recordkey.field="s3.object.key,eventName" \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
 \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=s3.bucket.name 
--enable-hive-sync \
   --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
 \
   --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
   --hoodie-conf hoodie.datasource.hive_sync.database=default \
   --hoodie-conf hoodie.datasource.hive_sync.table=s3_meta_table \
   --hoodie-conf hoodie.datasource.hive_sync.partition_fields=bucket \
   --source-class org.apache.hudi.utilities.sources.S3EventsSource \
   --hoodie-conf 
hoodie.deltastreamer.source.queue.url=https://sqs.us-east-1.amazonaws.com/118897059965/sqshudi
   --hoodie-conf hoodie.deltastreamer.s3.source.queue.region=us-east-1
   
   **Expected behavior**
   
   I wanted to run this spark submit successfully.
   
   **Environment Description**
   
   * Hudi version :
   
   * Hive version :
   *EMR Version : 5.33.1
   Hive 2.3.7
   Spark 2.4.7
   Flink 1.12.1
   
   
   * Storage (HDFS/S3/GCS..) :  S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] harishraju-govindaraju opened a new issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Reply via email to