Hi Team,
I am reading parquet data from HudiDeltaStreamer and writing data into Hudi
Dataset.
s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
I referred avro schema as target schema through parameter
hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
Deltastreamer command like
spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages
org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client
~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
--table-type COPY_ON_WRITE --source-ordering-field action_date --source-class
org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path
s3://emr-spark-scripts/hudi_spark_test --target-table hudi_spark_test
--transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer
--payload-class org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
--continuous
but I am getting issue of schema i.e
org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch:
Avro field 'cop_amt' not found
at
org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
at
org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
at
org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
at
org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
at
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
at
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
I have referred errored field into schema but still getting this issue.
Could you guys please help how can I refer schema file?
Thanks
Raghvendra