Re: Schema Reference in HudiDeltaStreamer

Raghvendra Dhar Dubey Mon, 16 Mar 2020 01:55:28 -0700

It is nullable
like {"name":"_id","type":["null","string"],"default":null}


On Mon, Mar 16, 2020 at 2:22 PM Pratyaksh Sharma <[email protected]>
wrote:

> How have you mentioned the field in your schema file? Is it a nullable
> field or is it having default value?
>
> On Mon, Mar 16, 2020 at 1:36 PM Raghvendra Dhar Dubey
> <[email protected]> wrote:
>
> > Thanks Pratyaksh,
> >
> > I got your point, but as in the example I used s3 avro schema file to
> refer
> > all emerged schema, It is not working.
> > I didn't try HiveSync Tool for this. Is there any option to refer glue?
> >
> >
> > On Mon, Mar 16, 2020 at 12:56 PM Pratyaksh Sharma <[email protected]
> >
> > wrote:
> >
> > > Hi Raghvendra,
> > >
> > > As mentioned in the FAQ, this error occurs when your schema has evolved
> > in
> > > terms of deleting some field, in your case 'cop_amt'. Even if your
> > current
> > > target schema has this field, the problem is occurring because some
> > > incoming record is not having this field. To fix this, you have the
> > > following options -
> > >
> > > 1. Make sure none of the fields get deleted.
> > > 2. Else have some default value for this field and send all your
> records
> > > with that default value
> > > 3. Try creating uber schema.
> > >
> > > By uber schema I mean to say, create a schema which has all the fields
> > > which were ever a part of your incoming records. If you are using
> > > HiveSyncTool along with DeltaStreamer, then hive metastore can be a
> good
> > > source of truth for getting all the fields ever ingested. Please let me
> > > know if this makes sense.
> > >
> > > On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
> > > <[email protected]> wrote:
> > >
> > > > Thanks Pratyaksh,
> > > > But I am assigning target schema here as
> > > >
> > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > >
> > > > But it doesn’t help, as per troubleshooting guide it is asking to
> build
> > > > Uber schema and refer It as target schema, but I am not sure about
> Uber
> > > > schema could you please help me into this?
> > > >
> > > > Thanks
> > > > Raghvendra
> > > >
> > > > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > This might help - Caused by: org.apache.parquet.io
> > > > .InvalidRecordException:
> > > > > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > > > > >
> > > > > .
> > > > >
> > > > > Please let us know in case of any more queries.
> > > > >
> > > > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > Hi Team,
> > > > > >
> > > > > >  I am reading parquet data from HudiDeltaStreamer and writing
> data
> > > into
> > > > > > Hudi Dataset.
> > > > > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > > > > >
> > > > > > I referred  avro schema as target schema through parameter
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > > >
> > > > > > Deltastreamer command like
> > > > > > spark-submit --class
> > > > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> > > --packages
> > > > > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn
> --deploy-mode
> > > > client
> > > > > >
> > > > >
> > > >
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > > > > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > > > > > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
> > > > > > --target-base-path s3://emr-spark-scripts/hudi_spark_test
> > > > --target-table
> > > > > > hudi_spark_test --transformer-class
> > > > > > org.apache.hudi.utilities.transform.AWSDmsTransformer
> > --payload-class
> > > > > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > > > > > --continuous
> > > > > >
> > > > > > but I  am getting issue of schema i.e
> > > > > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro
> schema
> > > > > > mismatch: Avro field 'cop_amt' not found
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> > > > > >         at
> > > > > >
> > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> > > > > >
> > > > > > I have referred errored field into schema but still getting this
> > > issue.
> > > > > > Could you guys please help how can I refer schema file?
> > > > > >
> > > > > > Thanks
> > > > > > Raghvendra
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Schema Reference in HudiDeltaStreamer

Reply via email to