Re: [PR] [AURON #1780] Fix the ORC table written by Hive to read null uppercase fields [auron]

via GitHub Thu, 25 Dec 2025 00:42:08 -0800


Copilot commented on code in PR #1781:
URL: https://github.com/apache/auron/pull/1781#discussion_r2646740850



##########
spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java:
##########
@@ -237,6 +237,11 @@ public class SparkAuronConfiguration extends 
AuronConfiguration {
             .description("use microsecond precision when reading ORC timestamp 
columns. ")
             .booleanType()
             .defaultValue(false);
+    public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE = 
ConfigOptions.key(
+                    "auron.orc.schema.iscasesensitive")
+            .description("does matching ORC file schema distinguish between 
uppercase and lowercase ")

Review Comment:
   The description has a grammatical error. It should read "does matching ORC 
file schema distinguish between uppercase and lowercase?" with a question mark 
at the end, or better yet, be rephrased as a statement: "whether ORC file 
schema matching distinguishes between uppercase and lowercase" or "enable 
case-sensitive matching for ORC file schema fields".
   ```suggestion
               .description("whether ORC file schema matching distinguishes 
between uppercase and lowercase.")
   ```



##########
spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java:
##########
@@ -237,6 +237,11 @@ public class SparkAuronConfiguration extends 
AuronConfiguration {
             .description("use microsecond precision when reading ORC timestamp 
columns. ")
             .booleanType()
             .defaultValue(false);
+    public static final ConfigOption<Boolean> ORC_SCHEMA_ISCASE_SENSITIVE = 
ConfigOptions.key(
+                    "auron.orc.schema.iscasesensitive")

Review Comment:
   The configuration key uses "iscasesensitive" (all lowercase), which is 
inconsistent with the naming convention used in similar ORC configurations. The 
key should use dots to separate words for readability, like 
"auron.orc.schema.is.case.sensitive" to match the pattern seen in other 
configurations such as "auron.orc.force.positional.evolution" and 
"auron.orc.timestamp.use.microsecond".



##########
native-engine/datafusion-ext-plans/src/orc_exec.rs:
##########
@@ -372,6 +377,21 @@ impl SchemaAdapter {
                     projection.push(named_column.data_type().column_index());
                 }
             }
+        } else {
+            for named_column in file_named_columns {
+                // Case-insensitive field name matching
+                let named_column_name_lower = 
named_column.name().to_lowercase();
+                if let Some((proj_idx, _)) = self
+                    .projected_schema
+                    .fields()
+                    .iter()
+                    .enumerate()
+                    .find(|(_, f)| f.name().to_lowercase() == 
named_column_name_lower)
+                {
+                    field_mappings[proj_idx] = Some(projection.len());
+                    projection.push(named_column.data_type().column_index());
+                }
+            }

Review Comment:
   The case-insensitive field matching performs string lowercase conversion on 
every iteration (line 383) and then again for each field comparison (line 389). 
For files with many columns, this could be inefficient. Consider pre-computing 
a HashMap of lowercase field names to their indices for O(1) lookup instead of 
O(n*m) where n is the number of file columns and m is the number of projected 
fields. Alternatively, at minimum, compute the lowercase versions of 
projected_schema field names once before the loop to avoid repeated conversions.



##########
native-engine/auron-jni-bridge/src/conf.rs:
##########
@@ -57,6 +57,7 @@ define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE);
 define_conf!(IntConf, SUGGESTED_BATCH_MEM_SIZE_KWAY_MERGE);
 define_conf!(BooleanConf, ORC_FORCE_POSITIONAL_EVOLUTION);
 define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);

Review Comment:
   The configuration name "ORC_SCHEMA_ISCASE_SENSITIVE" with a default value of 
false is semantically confusing. The name suggests that when set to false (the 
default), case-insensitive matching is disabled, but the actual behavior is the 
opposite: false enables case-insensitive matching. This creates a double 
negative that makes the configuration harder to understand. Consider renaming 
to "ORC_SCHEMA_CASE_INSENSITIVE" (with default true for Hive compatibility) or 
improving the documentation to clearly state that false means "case-insensitive 
matching enabled" and true means "case-sensitive matching enabled".
   ```suggestion
   define_conf!(BooleanConf, ORC_TIMESTAMP_USE_MICROSECOND);
   // NOTE: The semantics of this flag are intentionally inverted for Hive 
compatibility:
   // - Default: false  => case-insensitive schema matching is ENABLED
   // - true            => case-sensitive schema matching is ENABLED
   // The name ORC_SCHEMA_ISCASE_SENSITIVE is historical; do not change it 
without
   // coordinating with the corresponding configuration on the JVM side.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [AURON #1780] Fix the ORC table written by Hive to read null uppercase fields [auron]

Reply via email to