[
https://issues.apache.org/jira/browse/HUDI-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711431#comment-17711431
]
Léo Biscassi commented on HUDI-5997:
------------------------------------
Hi [~codope], I have some questions regarding the best way to get the source
schema for the {{{}HoodieIncrSource{}}}. I know that we need to add something
like the following code in the file
{{{}[CloudObjectsSelectorCommon.java|https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java]{}}}:
{code:java}
if (schemaProvider instanceof FilebasedSchemaProvider) {
DataFrameReader reader =
spark.read().schema(SCHEMA_HERE).format(fileFormat);
} else {
DataFrameReader reader = spark.read().format(fileFormat); // current version
}
{code}
My questions are:
* For getting the file format we are using the
{{{}HoodieIncrSourceConfig.java{}}}, but the config related to the schema
provider comes from
the option {{--schemaprovider-class}} parameter on
{{HoodieDeltaStreamer.java}}. What would be the recommended way to get this
info? Using the {{schemaProvider}} passed to the class constructor? Or another
way would be recommended?
{code:java}
public S3EventsHoodieIncrSource(
TypedProperties props,
JavaSparkContext sparkContext,
SparkSession sparkSession,
SchemaProvider schemaProvider) {
super(props, sparkContext, sparkSession, schemaProvider);
}
{code}
* Is there any documentation on how the {{SchemaProvider}} /
{{FileBasedSchemaProvider}} abstraction works? I know they provide two utility
methods {{getSourceSchema()}} and {{getTargetSchema()}} which returns an avro
schema. I suppose that's not enough for using with the DataFrame Reader schema
option, since it requires the schema in DDL format or struct, right? How can I
convert that? Asking this because I saw in the {{DeltaSync.java}} a method
{{registerAvroSchemas()}}, I'm not sure if I need to do something similar in
this case.
I apologize for asking many beginner-level questions. I do not regularly use
Java, but I understand where to add the code, and I am eager to gain a deeper
understanding of Hudi. This exercise has been quite helpful for me.
> Support DFS Schema Provider with S3/GCS EventsHoodieIncrSource
> --------------------------------------------------------------
>
> Key: HUDI-5997
> URL: https://issues.apache.org/jira/browse/HUDI-5997
> Project: Apache Hudi
> Issue Type: Improvement
> Components: deltastreamer
> Reporter: Sagar Sumit
> Assignee: Léo Biscassi
> Priority: Major
> Fix For: 0.14.0
>
>
> See for more details
--
This message was sent by Atlassian Jira
(v8.20.10#820010)