[I] [SUPPORT] [hudi]

via GitHub Wed, 20 Nov 2024 07:15:09 -0800


remeajayi2022 opened a new issue, #12301:
URL: https://github.com/apache/hudi/issues/12301

I’m trying to ingest from a ProtoKafka source using Hudi Streamer but
encountering an issue.

```
Exception in thread "main"
org.apache.hudi.utilities.ingestion.HoodieIngestionException: Ingestion service
was shut down with exception.
at ...
Error reading source schema from registry. Please check
hoodie.streamer.schemaprovider.registry.url is configured correctly. Truncated
URL: https://....ons/latest
at
org.apache.hudi.utilities.schema.SchemaRegistryProvider.parseSchemaFromRegistry(SchemaRegistryProvider.java:111)
at
org.apache.hudi.utilities.schema.SchemaRegistryProvider.getSourceSchema(SchemaRegistryProvider.java:204)
... 10 more
...
Caused by: org.apache.hudi.internal.schema.HoodieSchemaException: Failed to
parse schema from registry: syntax = "proto3";
package datagen;
...
Caused by: java.lang.NoSuchMethodException:
org.apache.hudi.utilities.schema.converter.ProtoSchemaToAvroSchemaConverter.<init>()
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.newInstance(Class.java:412)
... 13 more
```
The stack trace points to a misconfigured schema registry URL. However, the
same URL works for Hudi streamer jobs ingesting from AvroKafka sources.
When I ping the schema registry URL using curl, it correctly returns the
schema.

Environment Details
Hudi version: v0.15.0
Spark version: 3.1.3
Scala version: 2.12
Google Dataproc version: 2.0.125-debian10

Spark Submit Command and Protobuf Configuration
```
gcloud dataproc jobs submit spark --cluster <GCP-CLUSTER> \
--region us-central1 \
--class org.apache.hudi.utilities.streamer.HoodieStreamer \
--project <GCP-PROJECT> \
--jars
<jar-base-url>/jars/hudi-gcp-bundle-0.15.0.jar,<jar-base-url>/jars/spark-avro_2.12-3.1.1.jar,<jar-base-url>/jars/hudi-utilities-bundle-raw_2.12-0.15.0.jar,<jar-base-url>/jars/kafka-protobuf-provider-5.5.0.jar
\
--schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--source-class org.apache.hudi.utilities.sources.ProtoKafkaSource \
--hoodie-conf
sasl.jaas.config="org.apache.kafka.common.security.plain.PlainLoginModule
required username='<username>' password='<password>';" \
--hoodie-conf hoodie.streamer.schemaprovider.proto.class.name=<topic-name>
\
--hoodie-conf basic.auth.credentials.source=USER_INFO \
--hoodie-conf
schema.registry.basic.auth.user.info=<schema-registry-key>:<schema-registry-secret>
\
--hoodie-conf
hoodie.streamer.schemaprovider.registry.url=https://<schema-registry-key>:<schema-registry-secret>@<schema-registry-url>/subjects/<topic-name>-value/versions/latest
\
--hoodie-conf hoodie.streamer.source.kafka.topic=<topic-name> \
--hoodie-conf
hoodie.streamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.protobuf.KafkaProtobufDeserializer
\
--hoodie-conf
hoodie.streamer.schemaprovider.registry.schemaconverter=org.apache.hudi.utilities.schema.converter.ProtoSchemaToAvroSchemaConverter
\
```

Additional Context

1. I've verified the Protobuf schema is valid, it is a sample proto schema
from Confluent’s Datagen connector.
2. I've confirmed the schema registry URL is configured correctly, it works
fine with a similar`AvroKafka` spark job.
3. I added `hoodie.streamer.schemaprovider.proto.class.name` and
`hoodie.streamer.source.kafka.proto.value.deserializer.class=org.apache.kafka.common.serialization.ByteArrayDeserializer`.
I don't think these are required but their presence/absence did not resolve
this error.

Steps to Reproduce

1. Build a Hudi 0.15.0 JAR with Spark 3.1 and Scala 2.12.
2. Use a Protobuf schema on an accessible schema registry, preferably an
authenticated one.
3. Configure Hudi Streamer job with the spark submit command above.
4. Run the Spark job.

I’d appreciate any insights into resolving this issue.
Is there an alternative or a workaround for configuring the Protobuf schema?
Am I missing any configuration settings?
Thank you for your help!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] [hudi]

Reply via email to