leobiscassi opened a new issue, #8211: URL: https://github.com/apache/hudi/issues/8211
**Describe the problem you faced** I am running a delta streamer job to ingest JSON files from S3 using the `S3EventsHoodieIncrSource`. In this use case, I need to enforce the schema in the source files because there may or may not be some fields depending on certain occasions. According to the docs, I can do this using the `hoodie.deltastreamer.schemaprovider.source.schema.file` parameter, but it doesn't seem to be working. Although the documentation states that **"For sources that return Dataset<Row>, the schema is obtained implicitly. However, this CLI option allows overriding the schema provider returned by Source"**, this does not seem to apply to the specific source being referred to. Upon examining [this piece of code](https://github.com/apache/hudi/blob/178767948e906f673d6d4a357c65c11bc574f619/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java#L133), it appears that the informed schema is not being explicitly set. ```java String fileFormat = props.getString(SOURCE_FILE_FORMAT, DEFAULT_SOURCE_FILE_FORMAT); Option<Dataset<Row>> dataset = Option.empty(); if (!cloudFiles.isEmpty()) { dataset = Option.of(sparkSession.read().format(fileFormat).load(cloudFiles.toArray(new String[0]))); } return Pair.of(dataset, instantEndpts.getRight()); ``` If I inform a source schema using the parameter `hoodie.deltastreamer.schemaprovider.source.schema.file`, I expect that the schema will be enforced over all the files read in the job. Is it appropriate to consider this a bug? Should I fill a bug ticket on Jira? P.S: If my assumptions and analysis are right, I'd have interest in submitting a fix for this, since are affecting my workloads 😄 **Environment Description** This is happening in all hudi versions that I tested >= 0.9, I have jobs running with 0.9 and 0.11 on EMR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
