Thanks Pratyaksh, But I am assigning target schema here as hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
But it doesn’t help, as per troubleshooting guide it is asking to build Uber schema and refer It as target schema, but I am not sure about Uber schema could you please help me into this? Thanks Raghvendra On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <[email protected]> wrote: > This might help - Caused by: org.apache.parquet.io.InvalidRecordException: > Parquet/Avro schema mismatch: Avro field 'col1' not found > < > https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound > > > . > > Please let us know in case of any more queries. > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey > <[email protected]> wrote: > > > Hi Team, > > > > I am reading parquet data from HudiDeltaStreamer and writing data into > > Hudi Dataset. > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset) > > > > I referred avro schema as target schema through parameter > > > > > hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc > > > > Deltastreamer command like > > spark-submit --class > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client > > > ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar > > --table-type COPY_ON_WRITE --source-ordering-field action_date > > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource > > --target-base-path s3://emr-spark-scripts/hudi_spark_test --target-table > > hudi_spark_test --transformer-class > > org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf > > > hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/ > > --continuous > > > > but I am getting issue of schema i.e > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema > > mismatch: Avro field 'cop_amt' not found > > at > > > org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225) > > at > > > org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130) > > at > > > org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95) > > at > > > org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33) > > at > > > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138) > > at > > > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183) > > at > > > org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) > > at > > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > > > > I have referred errored field into schema but still getting this issue. > > Could you guys please help how can I refer schema file? > > > > Thanks > > Raghvendra > > > > > > >
