Hi Raghvendra, As mentioned in the FAQ, this error occurs when your schema has evolved in terms of deleting some field, in your case 'cop_amt'. Even if your current target schema has this field, the problem is occurring because some incoming record is not having this field. To fix this, you have the following options -
1. Make sure none of the fields get deleted. 2. Else have some default value for this field and send all your records with that default value 3. Try creating uber schema. By uber schema I mean to say, create a schema which has all the fields which were ever a part of your incoming records. If you are using HiveSyncTool along with DeltaStreamer, then hive metastore can be a good source of truth for getting all the fields ever ingested. Please let me know if this makes sense. On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey <[email protected]> wrote: > Thanks Pratyaksh, > But I am assigning target schema here as > > > hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc > > But it doesn’t help, as per troubleshooting guide it is asking to build > Uber schema and refer It as target schema, but I am not sure about Uber > schema could you please help me into this? > > Thanks > Raghvendra > > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <[email protected]> > wrote: > > > This might help - Caused by: org.apache.parquet.io > .InvalidRecordException: > > Parquet/Avro schema mismatch: Avro field 'col1' not found > > < > > > https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound > > > > > . > > > > Please let us know in case of any more queries. > > > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey > > <[email protected]> wrote: > > > > > Hi Team, > > > > > > I am reading parquet data from HudiDeltaStreamer and writing data into > > > Hudi Dataset. > > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset) > > > > > > I referred avro schema as target schema through parameter > > > > > > > > > hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc > > > > > > Deltastreamer command like > > > spark-submit --class > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages > > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode > client > > > > > > ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar > > > --table-type COPY_ON_WRITE --source-ordering-field action_date > > > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource > > > --target-base-path s3://emr-spark-scripts/hudi_spark_test > --target-table > > > hudi_spark_test --transformer-class > > > org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class > > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf > > > > > > hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/ > > > --continuous > > > > > > but I am getting issue of schema i.e > > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema > > > mismatch: Avro field 'cop_amt' not found > > > at > > > > > > org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225) > > > at > > > > > > org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130) > > > at > > > > > > org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95) > > > at > > > > > > org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33) > > > at > > > > > > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138) > > > at > > > > > > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183) > > > at > > > > > > org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) > > > at > > > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > > > > > > I have referred errored field into schema but still getting this issue. > > > Could you guys please help how can I refer schema file? > > > > > > Thanks > > > Raghvendra > > > > > > > > > > > >
