Copilot commented on code in PR #1781:
URL: https://github.com/apache/auron/pull/1781#discussion_r2646740850
##########
spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java:
##########
@@ -237,6 +237,11 @@ public class SparkAuronConfiguration extends
AuronConfiguration {
.description("use microsecond precision when reading ORC timestamp
columns. ")
.booleanType()
.defaultValue(false);
+ public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE =
ConfigOptions.key(
+ "auron.orc.schema.iscasesensitive")
+ .description("does matching ORC file schema distinguish between
uppercase and lowercase ")
Review Comment:
The description has a grammatical error. It should read "does matching ORC
file schema distinguish between uppercase and lowercase?" with a question mark
at the end, or better yet, be rephrased as a statement: "whether ORC file
schema matching distinguishes between uppercase and lowercase" or "enable
case-sensitive matching for ORC file schema fields".
```suggestion
.description("whether ORC file schema matching distinguishes
between uppercase and lowercase.")
```
##########
spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java:
##########
@@ -237,6 +237,11 @@ public class SparkAuronConfiguration extends
AuronConfiguration {
.description("use microsecond precision when reading ORC timestamp
columns. ")
.booleanType()
.defaultValue(false);
+ public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE =
ConfigOptions.key(
+ "auron.orc.schema.iscasesensitive")
Review Comment:
The configuration key uses "iscasesensitive" (all lowercase), which is
inconsistent with the naming convention used in similar ORC configurations. The
key should use dots to separate words for readability, like
"auron.orc.schema.is.case.sensitive" to match the pattern seen in other
configurations such as "auron.orc.force.positional.evolution" and
"auron.orc.timestamp.use.microsecond".
##########
native-engine/datafusion-ext-plans/src/orc_exec.rs:
##########
@@ -372,6 +377,21 @@ impl SchemaAdapter {
projection.push(named_column.data_type().column_index());
}
}
+ } else {
+ for named_column in file_named_columns {
+ // Case-insensitive field name matching
+ let named_column_name_lower =
named_column.name().to_lowercase();
+ if let Some((proj_idx, _)) = self
+ .projected_schema
+ .fields()
+ .iter()
+ .enumerate()
+ .find(|(_, f)| f.name().to_lowercase() ==
named_column_name_lower)
+ {
+ field_mappings[proj_idx] = Some(projection.len());
+ projection.push(named_column.data_type().column_index());
+ }
+ }
Review Comment:
The case-insensitive field matching performs string lowercase conversion on
every iteration (line 383) and then again for each field comparison (line 389).
For files with many columns, this could be inefficient. Consider pre-computing
a HashMap of lowercase field names to their indices for O(1) lookup instead of
O(n*m) where n is the number of file columns and m is the number of projected
fields. Alternatively, at minimum, compute the lowercase versions of
projected_schema field names once before the loop to avoid repeated conversions.
##########
native-engine/auron-jni-bridge/src/conf.rs:
##########
@@ -57,6 +57,7 @@ define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE);
define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE_KWAY_MERGE);
define_conf!(BooleanConf, ORC_FORCE_POSITIONAL_EVOLUTION);
define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);
Review Comment:
The configuration name "ORC_SCHEMA_ISCASE_SENSITIVE" with a default value of
false is semantically confusing. The name suggests that when set to false (the
default), case-insensitive matching is disabled, but the actual behavior is the
opposite: false enables case-insensitive matching. This creates a double
negative that makes the configuration harder to understand. Consider renaming
to "ORC_SCHEMA_CASE_INSENSITIVE" (with default true for Hive compatibility) or
improving the documentation to clearly state that false means "case-insensitive
matching enabled" and true means "case-sensitive matching enabled".
```suggestion
define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);
// NOTE: The semantics of this flag are intentionally inverted for Hive
compatibility:
// - Default: false => case-insensitive schema matching is ENABLED
// - true => case-sensitive schema matching is ENABLED
// The name ORC_SCHEMA_ISCASE_SENSITIVE is historical; do not change it
without
// coordinating with the corresponding configuration on the JVM side.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]