cbomgit opened a new issue, #9498:
URL: https://github.com/apache/hudi/issues/9498
Hello,
I'm using Hudi 0.11 on Spark 3.2.1 on EMR 6.7.0. I have a ingestion pipeline
where data is written out in daily batches and each batch is its own HUDI
table. In order words, data is structured like so:
```
basePath/
2023-08-10/ /* this is a HUDI table */
partitionColumn1/
partitionColumn2/
2023-08-11/ /* this is another HUDI table */
partitionColumn1/
partitionColumn2/
...
```
The schema is common across all tables. We recently added a column to this
schema and were trying to figure out the best way of handling this across out
downstream data processing jobs, which will typically read in a date range of
data of the above table.
I found a solution utilizing the `mergeSchema` option. If I read all tables
in like so, then my data is read in with the correct updated schema:
```
def getHudiReadOptions(s3ReadPath: String): Map[String, String] = Map(
DataSourceReadOptions.QUERY_TYPE.key() ->
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL,
DataSourceReadOptions.READ_PATHS.key() -> s3ReadPath,
HoodieMetadataConfig.ENABLE.key() -> "true"
)
val paths = List(
"s3://daily-data/2023-08-14/*/*/*", /* this path has the old schema */
"s3://daily-data/2023-08-15/*/*/*", /* this path has the new field */
)
val merged = spark.read.format("org.apache.hudi")
.options(getHudiReadOptions(paths.mkString(",")))
.option("mergeSchema", "true")
.load()
```
If I do not include the `mergeSchema` option, then the data is read in but
is missing the new field. My question is: is this expected behavior? Can we
rely on the mergeSchema option to handle these kinds of schema differences? I
have read through the schema evolution documentation listed here
(https://hudi.apache.org/docs/0.11.0/schema_evolution) but have not seen a
mention of using this option. Our use case is not typical with multiple hudi
tables per day of data, so wanted to check to ensure that this behavior is
reliable.
**Environment Description**
* Hudi version : 0.11
* Spark version : 3.2.1
* Hive version : 3.1.3
* Hadoop version : 3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]