[I] Support DFS Schema Provider with S3/GCS EventsHoodieIncrSource [hudi]

via GitHub Sat, 29 Nov 2025 22:38:52 -0800


hudi-bot opened a new issue, #15869:
URL: https://github.com/apache/hudi/issues/15869


   See for more details
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-5997
   - Type: Improvement
   
   
   ---
   
   
   ## Comments
   
   12/Apr/23 14:53;leobiscassi;Hi [~codope], I have some questions regarding 
the best way to get the source schema for the {{{}HoodieIncrSource{}}}. I know 
that we need to add something like the following code in the file 
{{{}[CloudObjectsSelectorCommon.java|https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java]{}}}:
   {code:java}
   if (schemaProvider instanceof FilebasedSchemaProvider) {
       DataFrameReader reader = 
spark.read().schema(SCHEMA_HERE).format(fileFormat);
   } else {
       DataFrameReader reader = spark.read().format(fileFormat); // current 
version
   }
   {code}
   My questions are:
    * For getting the file format we are using the 
{{{}HoodieIncrSourceConfig.java{}}}, but the config related to the schema 
provider comes from
   the option {{--schemaprovider-class}} parameter on 
{{HoodieDeltaStreamer.java}}. What would be the recommended way to get this 
info? Using the {{schemaProvider}} passed to the class constructor? Or another 
way would be recommended?
   
   {code:java}
     public S3EventsHoodieIncrSource(
         TypedProperties props,
         JavaSparkContext sparkContext,
         SparkSession sparkSession,
         SchemaProvider schemaProvider) {
       super(props, sparkContext, sparkSession, schemaProvider);
     }
   {code}
   
   * Is there any documentation on how the {{SchemaProvider}} / 
{{FileBasedSchemaProvider}} abstraction works? I know they provide two utility 
methods {{getSourceSchema()}} and {{getTargetSchema()}} which returns an avro 
schema. I suppose that's not enough for using with the DataFrame Reader schema 
option, since it requires the schema in DDL format or struct, right? How can I 
convert that? Asking this because I saw in the {{DeltaSync.java}} a method 
{{registerAvroSchemas()}}, I'm not sure if I need to do something similar in 
this case.
   
   I apologize for asking many beginner-level questions. I do not regularly use 
Java, but I understand where to add the code, and I am eager to gain a deeper 
understanding of Hudi. This exercise has been quite helpful for me.;;;
   
   ---
   
   13/Apr/23 13:40;codope;That's a good question. Whenever you start the 
deltastreamer for the incremental source using {{spark-submit}} command, you 
can provide --schemaprovider-class as 
{{org.apache.hudi.utilities.schema.FilebasedSchemaProvider}} and additionally 
pass the source schema file that you want to enforce as {{-hoodie-conf 
hoodie.deltastreamer.schemaprovider.source.schema.file=/path/to/source/schema.avsc}}
 to the same spark-submit command. So the full command will look something like:
   {code:java}
   spark-submit \
   --jars "<hudi-utilities-bundle_jar>,<other-jars-that-you-add-in-classpath>" \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
<hudi-utilities-bundle_jar> \
   --table-type COPY_ON_WRITE \
   --source-ordering-field <ordering key from source data> \
   --target-base-path s3://bucket_name/path/for/s3_hudi_table \
   --target-table s3_hudi_table  \
   --continuous \
   --min-sync-interval-seconds 10 \
   ...
   ...
   --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.source.schema.file=/path/to/source/schema.avsc
 \
   --hoodie-conf hoodie.datasource.write.recordkey.field="<record key from 
source data>" \
   ...
   ...
   --source-class org.apache.hudi.utilities.sources.S3EventsHoodieIncrSource \
   --hoodie-conf 
hoodie.deltastreamer.source.hoodieincr.path=s3://bucket_name/path/for/s3_meta_table
 \
   --hoodie-conf 
hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt=true 
{code};;;
   
   ---
   
   13/Apr/23 14:15;leobiscassi;[~codope] I mean, I know the behavior from the 
user perspective, I use this feature daily, my question was more related to the 
behavior of this {{SchemaProvider}} class in the code and, since the 
{{--schemaprovider-class}} is just informed in the HoodieDeltaStreamer class, 
I'm wondering how would be the best way to pass it for the cloud object 
selector, since it doesn't seem to have a class like the general hudi 
properties have. The avro schema that I get from `.getSourceSchema()` and 
`.getTargetSchema()` seems to be an avro schema, that doesn't seem to work with 
the spark reader schema option either, what would be the best way to get the 
schema struct to pass for the spark reader?;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Support DFS Schema Provider with S3/GCS EventsHoodieIncrSource [hudi]

Reply via email to