jcunhafonte opened a new issue #1813:
URL: https://github.com/apache/hudi/issues/1813


   I'm executing the CDC example scenario 
(http://hudi.apache.org/blog/change-capture-using-aws/) on Amazon EMR (5.30.0) 
and running into an issue when running the command suggested for the second and 
following times.
   
   Have DMS generate the raw .parquet files in S3.
   Use HoodieDeltaStreamer to process the raw .parquet files:
   
   ```
   spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
     --packages org.apache.spark:spark-avro_2.12:2.4.4 \
     --master yarn --deploy-mode client \
     hudi-utilities-bundle_2.12-0.5.2-incubating.jar \
     --table-type COPY_ON_WRITE \
     --source-ordering-field dms_timestamp \
     --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
     --target-base-path s3://my-test-bucket/hudi_orders --target-table 
hudi_orders \
     --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \
     --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
     --props file:///usr/lib/hudi/dfs-source.properties \
     --hoodie-conf 
hoodie.datasource.write.recordkey.field=id,hoodie.datasource.write.partitionpath.field=id,hoodie.deltastreamer.source.dfs.root=s3:/my-test-bucket/hudi_dms/orders
   ```
   When I run it for the first time it works perfectly however when I try to 
keep "refreshing" the data on a scheduled job I get the following error:
   ```
   ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down
   org.apache.hudi.exception.HoodieException: Please provide a valid schema 
provider class!
        at 
org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   ```
   
   Hudi version : 0.5.2 (incubating)
   Spark version : 2.4.4
   Hive version : 3.1.2
   Storage (HDFS/S3/GCS..) : S3
   Running on Docker? (yes/no) : No
   
   Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to