voonhous commented on code in PR #7480:
URL: https://github.com/apache/hudi/pull/7480#discussion_r1053930850
##########
hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala:
##########
@@ -228,7 +228,24 @@ class Spark32PlusHoodieParquetFileFormat(private val
shouldAppendPartitionValues
SparkInternalSchemaConverter.collectTypeChangedCols(querySchemaOption.get(),
mergedInternalSchema)
} else {
- new java.util.HashMap()
+ val implicitTypeChangeInfo: java.util.Map[Integer, Pair[DataType,
DataType]] = new java.util.HashMap()
Review Comment:
> Nonetheless, a configuration key can be introduced where in this behaviour
is enabled by default.
> > @voonhous Maybe we need a parameter to control this feature, not all
tables need to follow this logic
>
> Hmmm, CMIIW, Hudi has been relying on ASR for schema resolution since
`hudi-0.7`. As such, I was under the impression that this should be a default
behaviour.
>
> Nonetheless, a configuration key can be introduced where in this behaviour
is enabled by default.
>
> However, validation will need to be performed such that the choice between
ASR/HFSE is mutually exclusive. i.e. if ASR is enabled, HFSE should be disabled
and vice-versa. WDYT?
@xiarixiaoyao I looked at the code and realised that there is no way
validate configuration values based on other configuration values.
I wanted to add a `AVRO_SCHEMA_RESOLUTION_ENABLE` configuration key with the
description:
```text
Enable support for schema evolution using Avro's Schema Resolution (ASR).
This configuration is mutually exclusive to Hudi's Full/Comprehensive Schema
Evolution (HFSE) feature via the configuration key
(hoodie.schema.on.read.enable).
The choice between ASR/HFSE is mutually exclusive. i.e. if ASR is enabled,
HFSE should be disabled and vice-versa.
HFSE will take precedence over ASR. i.e. Enabling both HFSE and ASR will
cause Hudi to default to HFSE for schema evolution.
```
Given that this is the intended behaviour and lack of configuration
validation, I see no benefit for introducing `AVRO_SCHEMA_RESOLUTION_ENABLE`.
Since `SCHEMA_EVOLUTION_ENABLE` will take precedence over
`AVRO_SCHEMA_RESOLUTION_ENABLE`, I think we can rely on the former
(`SCHEMA_EVOLUTION_ENABLE`) to determine if ASR should be used.
If `SCHEMA_EVOLUTION_ENABLE` is enabled, use HFSE, else, fallback to ASR.
WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]