yadavay-amzn opened a new pull request, #55928: URL: https://github.com/apache/spark/pull/55928
### What changes were proposed in this pull request? `getFieldByKey()` uses binary search for objects with >=32 fields, assuming field IDs are sorted alphabetically by key name. The Variant format spec allows unsorted objects (indicated by bit 4 of the object header). External producers (Parquet, Iceberg) may produce unsorted variants, causing binary search to silently return null for keys that exist. Fix: check the object header sort bit before choosing binary search vs linear scan. Fall back to linear scan when fields are unsorted. ### Why are the changes needed? Data correctness bug -- `getFieldByKey` silently returns null for fields that exist in unsorted variant objects. This affects any variant data produced by external systems that do not sort field IDs. ### Does this PR introduce _any_ user-facing change? Yes -- queries on variant columns with unsorted objects will now correctly return field values instead of null. ### How was this patch tested? Added test in `VariantExpressionSuite` that constructs a 32-field unsorted variant object (sort bit=0, field IDs in reverse order) and verifies `getFieldByKey` finds keys correctly. Test fails without the fix (binary search returns null), passes with it. ### Was this patch authored or co-authored using generative AI tooling? Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
