hudi-agent commented on code in PR #18385:
URL: https://github.com/apache/hudi/pull/18385#discussion_r3191119558
##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java:
##########
@@ -546,6 +547,34 @@ private static Option<String> getPropVal(TypedProperties
props, ConfigProperty<S
return Option.empty();
}
+ /**
+ * Enables Spark {@code mergeSchema} for cloud object batches of Parquet or
ORC files when configured, so
+ * heterogeneous files in one sync round share a merged struct type. Applied
before user
+ * {@link CloudSourceConfig#SPARK_DATASOURCE_OPTIONS} so explicit reader
options can override.
+ *
+ * <p>Spark's native Parquet reader honors {@code mergeSchema} on all
supported versions. Spark's native ORC
+ * reader honors it on Spark 3.0+ (the native ORC impl is the default since
Spark 2.4); on older runtimes the
+ * option is silently ignored, which is harmless.
+ */
+ private DataFrameReader applyMergeSchemaOption(DataFrameReader reader,
String fileFormat) {
+ if (!isParquetOrOrcFileFormat(fileFormat)) {
+ return reader;
+ }
+ if (!getBooleanWithAltKeys(properties, CLOUD_INCREMENTAL_MERGE_SCHEMA)) {
+ return reader;
+ }
+ return reader.option("mergeSchema", "true");
+ }
+
+ // Package-private for unit testing — see TestCloudObjectsSelectorCommon.
+ static boolean isParquetOrOrcFileFormat(String fileFormat) {
+ if (fileFormat == null) {
+ return false;
+ }
+ String f = fileFormat.trim();
Review Comment:
🤖 nit: the single-character name `f` doesn't communicate intent here — could
you rename it to `trimmed` (or just inline `fileFormat.trim()` in the return
expression) so it's immediately clear what the variable represents?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]