Re: Schema Reference in HudiDeltaStreamer

Pratyaksh Sharma Mon, 16 Mar 2020 03:43:21 -0700

Hi Raghvendra,

As per the code flow of Parquet reader, I do not see any reason why this
exception should be thrown if your target schema is actually having the
concerned field. I would suggest to print the target schema just before
ParquetReader flow starts in HoodieCopyOnWriteTable class i.e you need to
print writerSchema in HoodieMergeHandle and cross check if the concerned
field is actually getting passed to ParquetReader.


On Mon, Mar 16, 2020 at 2:25 PM Raghvendra Dhar Dubey
<[email protected]> wrote:

> It is nullable
> like {"name":"_id","type":["null","string"],"default":null}
>
> On Mon, Mar 16, 2020 at 2:22 PM Pratyaksh Sharma <[email protected]>
> wrote:
>
> > How have you mentioned the field in your schema file? Is it a nullable
> > field or is it having default value?
> >
> > On Mon, Mar 16, 2020 at 1:36 PM Raghvendra Dhar Dubey
> > <[email protected]> wrote:
> >
> > > Thanks Pratyaksh,
> > >
> > > I got your point, but as in the example I used s3 avro schema file to
> > refer
> > > all emerged schema, It is not working.
> > > I didn't try HiveSync Tool for this. Is there any option to refer glue?
> > >
> > >
> > > On Mon, Mar 16, 2020 at 12:56 PM Pratyaksh Sharma <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Hi Raghvendra,
> > > >
> > > > As mentioned in the FAQ, this error occurs when your schema has
> evolved
> > > in
> > > > terms of deleting some field, in your case 'cop_amt'. Even if your
> > > current
> > > > target schema has this field, the problem is occurring because some
> > > > incoming record is not having this field. To fix this, you have the
> > > > following options -
> > > >
> > > > 1. Make sure none of the fields get deleted.
> > > > 2. Else have some default value for this field and send all your
> > records
> > > > with that default value
> > > > 3. Try creating uber schema.
> > > >
> > > > By uber schema I mean to say, create a schema which has all the
> fields
> > > > which were ever a part of your incoming records. If you are using
> > > > HiveSyncTool along with DeltaStreamer, then hive metastore can be a
> > good
> > > > source of truth for getting all the fields ever ingested. Please let
> me
> > > > know if this makes sense.
> > > >
> > > > On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
> > > > <[email protected]> wrote:
> > > >
> > > > > Thanks Pratyaksh,
> > > > > But I am assigning target schema here as
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > >
> > > > > But it doesn’t help, as per troubleshooting guide it is asking to
> > build
> > > > > Uber schema and refer It as target schema, but I am not sure about
> > Uber
> > > > > schema could you please help me into this?
> > > > >
> > > > > Thanks
> > > > > Raghvendra
> > > > >
> > > > > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > This might help - Caused by: org.apache.parquet.io
> > > > > .InvalidRecordException:
> > > > > > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > > > > > >
> > > > > > .
> > > > > >
> > > > > > Please let us know in case of any more queries.
> > > > > >
> > > > > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > > > > > <[email protected]> wrote:
> > > > > >
> > > > > > > Hi Team,
> > > > > > >
> > > > > > >  I am reading parquet data from HudiDeltaStreamer and writing
> > data
> > > > into
> > > > > > > Hudi Dataset.
> > > > > > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > > > > > >
> > > > > > > I referred  avro schema as target schema through parameter
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > > > >
> > > > > > > Deltastreamer command like
> > > > > > > spark-submit --class
> > > > > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> > > > --packages
> > > > > > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn
> > --deploy-mode
> > > > > client
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > > > > > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > > > > > > --source-class
> org.apache.hudi.utilities.sources.ParquetDFSSource
> > > > > > > --target-base-path s3://emr-spark-scripts/hudi_spark_test
> > > > > --target-table
> > > > > > > hudi_spark_test --transformer-class
> > > > > > > org.apache.hudi.utilities.transform.AWSDmsTransformer
> > > --payload-class
> > > > > > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > > > > > > --continuous
> > > > > > >
> > > > > > > but I  am getting issue of schema i.e
> > > > > > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro
> > schema
> > > > > > > mismatch: Avro field 'cop_amt' not found
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> > > > > > >         at
> > > > > > >
> > > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> > > > > > >
> > > > > > > I have referred errored field into schema but still getting
> this
> > > > issue.
> > > > > > > Could you guys please help how can I refer schema file?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Raghvendra
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Schema Reference in HudiDeltaStreamer

Reply via email to