qlong opened a new pull request, #54343: URL: https://github.com/apache/spark/pull/54343
### Why are the changes needed? Variant shredding schema inference is expensive and can take over 100ms per file. Replace fold-based schema merging with deferred schema construction using single-pass field statistics collection. Previous approach: - Used foldLeft to build and merge complete schemas for each row - Merged schemas repeatedly across 4096 rows - High allocation overhead from recursive schema construction New approach: - Separate schema construction from field statistics collection to avoid excessive intermediate allocations and repeated merges. - Single-pass field traversal with flat statistics registry to track field types and row counts - Using lastSeenRow for deduplication - Defers schema construction until after all rows processed ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Functional test: * Pass all existing unit tests Performance vs master: - Tested with scenarios with different field counts, array sizes, and batch sizes(1-4096 rows, 10-200 fields, varying nesting depths and sparsity patterns). - Average 1.5x speedup across test scenarios - 1.5x-1.6x faster on array-heavy workloads - 11.5x faster on sparse data (10% field presence) - Consistent performance across multiple runs - 96% of tests show improvement ### Was this patch authored or co-authored using generative AI tooling? Co-authored with Claude Sonnet 4.5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
