cbomgit opened a new issue, #9498:
URL: https://github.com/apache/hudi/issues/9498

   Hello,
   
   I'm using Hudi 0.11 on Spark 3.2.1 on EMR 6.7.0. I have a ingestion pipeline 
where data is written out in daily batches and each batch is its own HUDI 
table. In order words, data is structured like so:
   
   ```
   basePath/
      2023-08-10/    /* this is a HUDI table */
           partitionColumn1/
                 partitionColumn2/
      2023-08-11/    /* this is another HUDI table */
           partitionColumn1/
                 partitionColumn2/
      ...
   ```
   
   The schema is common across all tables. We recently added a column to this 
schema and were trying to figure out the best way of handling this across out 
downstream data processing jobs, which will typically read in a date range of 
data of the above table.
   
   I found a solution utilizing the `mergeSchema` option. If I read all tables 
in like so, then my data is read in with the correct updated schema:
   
   ```
     def getHudiReadOptions(s3ReadPath: String): Map[String, String] = Map(
       DataSourceReadOptions.QUERY_TYPE.key() -> 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL,
       DataSourceReadOptions.READ_PATHS.key() -> s3ReadPath,
       HoodieMetadataConfig.ENABLE.key() -> "true"
     )
   
   val paths = List(
       "s3://daily-data/2023-08-14/*/*/*",    /* this path has the old schema */
       "s3://daily-data/2023-08-15/*/*/*",    /* this path has the new field */
   )
   val merged = spark.read.format("org.apache.hudi")
        .options(getHudiReadOptions(paths.mkString(",")))
        .option("mergeSchema", "true")
        .load()
   ```
   
   If I do not include the `mergeSchema` option, then the data is read in but 
is missing the new field. My question is: is this expected behavior? Can we 
rely on the mergeSchema option to handle these kinds of schema differences? I 
have read through the schema evolution documentation listed here 
(https://hudi.apache.org/docs/0.11.0/schema_evolution) but have not seen a 
mention of using this option. Our use case is not typical with multiple hudi 
tables per day of data, so wanted to check to ensure that this behavior is 
reliable.
   
   **Environment Description**
   
   * Hudi version : 0.11
   
   * Spark version : 3.2.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to