jcunhafonte opened a new issue #1813: URL: https://github.com/apache/hudi/issues/1813
I'm executing the CDC example scenario (http://hudi.apache.org/blog/change-capture-using-aws/) on Amazon EMR (5.30.0) and running into an issue when running the command suggested for the second and following times. Have DMS generate the raw .parquet files in S3. Use HoodieDeltaStreamer to process the raw .parquet files: ``` spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --packages org.apache.spark:spark-avro_2.12:2.4.4 \ --master yarn --deploy-mode client \ hudi-utilities-bundle_2.12-0.5.2-incubating.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field dms_timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3://my-test-bucket/hudi_orders --target-table hudi_orders \ --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \ --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \ --props file:///usr/lib/hudi/dfs-source.properties \ --hoodie-conf hoodie.datasource.write.recordkey.field=id,hoodie.datasource.write.partitionpath.field=id,hoodie.deltastreamer.source.dfs.root=s3:/my-test-bucket/hudi_dms/orders ``` When I run it for the first time it works perfectly however when I try to keep "refreshing" the data on a scheduled job I get the following error: ``` ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class! at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` Hudi version : 0.5.2 (incubating) Spark version : 2.4.4 Hive version : 3.1.2 Storage (HDFS/S3/GCS..) : S3 Running on Docker? (yes/no) : No Thank you. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
