Re: Schema Reference in HudiDeltaStreamer

Pratyaksh Sharma Mon, 16 Mar 2020 00:27:41 -0700

Hi Raghvendra,

As mentioned in the FAQ, this error occurs when your schema has evolved in
terms of deleting some field, in your case 'cop_amt'. Even if your current
target schema has this field, the problem is occurring because some
incoming record is not having this field. To fix this, you have the
following options -


1. Make sure none of the fields get deleted.
2. Else have some default value for this field and send all your records
with that default value
3. Try creating uber schema.

By uber schema I mean to say, create a schema which has all the fields
which were ever a part of your incoming records. If you are using
HiveSyncTool along with DeltaStreamer, then hive metastore can be a good
source of truth for getting all the fields ever ingested. Please let me
know if this makes sense.

On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
<[email protected]> wrote:

> Thanks Pratyaksh,
> But I am assigning target schema here as
>
>
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
>
> But it doesn’t help, as per troubleshooting guide it is asking to build
> Uber schema and refer It as target schema, but I am not sure about Uber
> schema could you please help me into this?
>
> Thanks
> Raghvendra
>
> On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <[email protected]>
> wrote:
>
> > This might help - Caused by: org.apache.parquet.io
> .InvalidRecordException:
> > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > >
> > .
> >
> > Please let us know in case of any more queries.
> >
> > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > <[email protected]> wrote:
> >
> > > Hi Team,
> > >
> > >  I am reading parquet data from HudiDeltaStreamer and writing data into
> > > Hudi Dataset.
> > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > >
> > > I referred  avro schema as target schema through parameter
> > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > >
> > > Deltastreamer command like
> > > spark-submit --class
> > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages
> > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode
> client
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
> > > --target-base-path s3://emr-spark-scripts/hudi_spark_test
> --target-table
> > > hudi_spark_test --transformer-class
> > > org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class
> > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> > >
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > > --continuous
> > >
> > > but I  am getting issue of schema i.e
> > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
> > > mismatch: Avro field 'cop_amt' not found
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> > >         at
> > >
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> > >         at
> > >
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> > >         at
> > > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> > >
> > > I have referred errored field into schema but still getting this issue.
> > > Could you guys please help how can I refer schema file?
> > >
> > > Thanks
> > > Raghvendra
> > >
> > >
> > >
> >
>

Re: Schema Reference in HudiDeltaStreamer

Reply via email to