leobiscassi opened a new issue, #8211:
URL: https://github.com/apache/hudi/issues/8211

   **Describe the problem you faced**
   
   I am running a delta streamer job to ingest JSON files from S3 using the 
`S3EventsHoodieIncrSource`. In this use case, I need to enforce the schema in 
the source files because there may or may not be some fields depending on 
certain occasions. According to the docs, I can do this using the 
`hoodie.deltastreamer.schemaprovider.source.schema.file` parameter, but it 
doesn't seem to be working.
   
   Although the documentation states that **"For sources that return 
Dataset<Row>, the schema is obtained implicitly. However, this CLI option 
allows overriding the schema provider returned by Source"**, this does not seem 
to apply to the specific source being referred to. Upon examining [this piece 
of 
code](https://github.com/apache/hudi/blob/178767948e906f673d6d4a357c65c11bc574f619/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java#L133),
 it appears that the informed schema is not being explicitly set.
   
   ```java
       String fileFormat = props.getString(SOURCE_FILE_FORMAT, 
DEFAULT_SOURCE_FILE_FORMAT);
       Option<Dataset<Row>> dataset = Option.empty();
       if (!cloudFiles.isEmpty()) {
         dataset = 
Option.of(sparkSession.read().format(fileFormat).load(cloudFiles.toArray(new 
String[0])));
       }
       return Pair.of(dataset, instantEndpts.getRight());
   ```
   
   If I inform a source schema using the parameter 
`hoodie.deltastreamer.schemaprovider.source.schema.file`, I expect that the 
schema will be enforced over all the files read in the job. Is it appropriate 
to consider this a bug? Should I fill a bug ticket on Jira?
   
   P.S: If my assumptions and analysis are right, I'd have interest in 
submitting a fix for this, since are affecting my workloads 😄 
   
   **Environment Description**
   
   This is happening in all hudi versions that I tested >= 0.9, I have jobs 
running with 0.9 and 0.11 on EMR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to