[ 
https://issues.apache.org/jira/browse/HUDI-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711431#comment-17711431
 ] 

Léo Biscassi commented on HUDI-5997:
------------------------------------

Hi [~codope], I have some questions regarding the best way to get the source 
schema for the {{{}HoodieIncrSource{}}}. I know that we need to add something 
like the following code in the file 
{{{}[CloudObjectsSelectorCommon.java|https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java]{}}}:
{code:java}
if (schemaProvider instanceof FilebasedSchemaProvider) {
    DataFrameReader reader = 
spark.read().schema(SCHEMA_HERE).format(fileFormat);
} else {
    DataFrameReader reader = spark.read().format(fileFormat); // current version
}
{code}
My questions are:
 * For getting the file format we are using the 
{{{}HoodieIncrSourceConfig.java{}}}, but the config related to the schema 
provider comes from
the option {{--schemaprovider-class}} parameter on 
{{HoodieDeltaStreamer.java}}. What would be the recommended way to get this 
info? Using the {{schemaProvider}} passed to the class constructor? Or another 
way would be recommended?

{code:java}
  public S3EventsHoodieIncrSource(
      TypedProperties props,
      JavaSparkContext sparkContext,
      SparkSession sparkSession,
      SchemaProvider schemaProvider) {
    super(props, sparkContext, sparkSession, schemaProvider);
  }
{code}

* Is there any documentation on how the {{SchemaProvider}} / 
{{FileBasedSchemaProvider}} abstraction works? I know they provide two utility 
methods {{getSourceSchema()}} and {{getTargetSchema()}} which returns an avro 
schema. I suppose that's not enough for using with the DataFrame Reader schema 
option, since it requires the schema in DDL format or struct, right? How can I 
convert that? Asking this because I saw in the {{DeltaSync.java}} a method 
{{registerAvroSchemas()}}, I'm not sure if I need to do something similar in 
this case.

I apologize for asking many beginner-level questions. I do not regularly use 
Java, but I understand where to add the code, and I am eager to gain a deeper 
understanding of Hudi. This exercise has been quite helpful for me.

> Support DFS Schema Provider with S3/GCS EventsHoodieIncrSource
> --------------------------------------------------------------
>
>                 Key: HUDI-5997
>                 URL: https://issues.apache.org/jira/browse/HUDI-5997
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: deltastreamer
>            Reporter: Sagar Sumit
>            Assignee: Léo Biscassi
>            Priority: Major
>             Fix For: 0.14.0
>
>
> See for more details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to