hudi-bot opened a new issue, #15869: URL: https://github.com/apache/hudi/issues/15869
See for more details ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-5997 - Type: Improvement --- ## Comments 12/Apr/23 14:53;leobiscassi;Hi [~codope], I have some questions regarding the best way to get the source schema for the {{{}HoodieIncrSource{}}}. I know that we need to add something like the following code in the file {{{}[CloudObjectsSelectorCommon.java|https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java]{}}}: {code:java} if (schemaProvider instanceof FilebasedSchemaProvider) { DataFrameReader reader = spark.read().schema(SCHEMA_HERE).format(fileFormat); } else { DataFrameReader reader = spark.read().format(fileFormat); // current version } {code} My questions are: * For getting the file format we are using the {{{}HoodieIncrSourceConfig.java{}}}, but the config related to the schema provider comes from the option {{--schemaprovider-class}} parameter on {{HoodieDeltaStreamer.java}}. What would be the recommended way to get this info? Using the {{schemaProvider}} passed to the class constructor? Or another way would be recommended? {code:java} public S3EventsHoodieIncrSource( TypedProperties props, JavaSparkContext sparkContext, SparkSession sparkSession, SchemaProvider schemaProvider) { super(props, sparkContext, sparkSession, schemaProvider); } {code} * Is there any documentation on how the {{SchemaProvider}} / {{FileBasedSchemaProvider}} abstraction works? I know they provide two utility methods {{getSourceSchema()}} and {{getTargetSchema()}} which returns an avro schema. I suppose that's not enough for using with the DataFrame Reader schema option, since it requires the schema in DDL format or struct, right? How can I convert that? Asking this because I saw in the {{DeltaSync.java}} a method {{registerAvroSchemas()}}, I'm not sure if I need to do something similar in this case. I apologize for asking many beginner-level questions. I do not regularly use Java, but I understand where to add the code, and I am eager to gain a deeper understanding of Hudi. This exercise has been quite helpful for me.;;; --- 13/Apr/23 13:40;codope;That's a good question. Whenever you start the deltastreamer for the incremental source using {{spark-submit}} command, you can provide --schemaprovider-class asĀ {{org.apache.hudi.utilities.schema.FilebasedSchemaProvider}} and additionally pass the source schema file that you want to enforce as {{-hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=/path/to/source/schema.avsc}} to the same spark-submit command. So the full command will look something like: {code:java} spark-submit \ --jars "<hudi-utilities-bundle_jar>,<other-jars-that-you-add-in-classpath>" \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer <hudi-utilities-bundle_jar> \ --table-type COPY_ON_WRITE \ --source-ordering-field <ordering key from source data> \ --target-base-path s3://bucket_name/path/for/s3_hudi_table \ --target-table s3_hudi_table \ --continuous \ --min-sync-interval-seconds 10 \ ... ... --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ --hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=/path/to/source/schema.avsc \ --hoodie-conf hoodie.datasource.write.recordkey.field="<record key from source data>" \ ... ... --source-class org.apache.hudi.utilities.sources.S3EventsHoodieIncrSource \ --hoodie-conf hoodie.deltastreamer.source.hoodieincr.path=s3://bucket_name/path/for/s3_meta_table \ --hoodie-conf hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt=true {code};;; --- 13/Apr/23 14:15;leobiscassi;[~codope] I mean, I know the behavior from the user perspective, I use this feature daily, my question was more related to the behavior of this {{SchemaProvider}} class in the code and, since the {{--schemaprovider-class}} is just informed in the HoodieDeltaStreamer class, I'm wondering how would be the best way to pass it for the cloud object selector, since it doesn't seem to have a class like the general hudi properties have. The avro schema that I get from `.getSourceSchema()` and `.getTargetSchema()` seems to be an avro schema, that doesn't seem to work with the spark reader schema option either, what would be the best way to get the schema struct to pass for the spark reader?;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
