[PR] [SPARK-55568][SQL] Separate schema construction from field stats collection [spark]

via GitHub Mon, 16 Feb 2026 21:02:07 -0800


qlong opened a new pull request, #54343:
URL: https://github.com/apache/spark/pull/54343


   ### Why are the changes needed?
   
   Variant shredding schema inference is expensive and can take over 100ms per 
file. Replace fold-based schema merging with deferred schema construction using 
single-pass field statistics collection.
   
   Previous approach:
   - Used foldLeft to build and merge complete schemas for each row
   - Merged schemas repeatedly across 4096 rows
   - High allocation overhead from recursive schema construction
   
   New approach:
   - Separate schema construction from field statistics collection to avoid 
excessive intermediate allocations and repeated merges.
   - Single-pass field traversal with flat statistics registry to track field 
types and row counts
   - Using lastSeenRow for deduplication
   - Defers schema construction until after all rows processed
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. 
   
   ### How was this patch tested?
   
   Functional test: 
   * Pass all existing unit tests
   
   Performance vs master:
   - Tested with scenarios with different field counts, array sizes, and batch 
sizes(1-4096 rows, 10-200 fields, varying nesting depths and sparsity patterns).
   - Average 1.5x speedup across test scenarios
   - 1.5x-1.6x faster on array-heavy workloads
   - 11.5x faster on sparse data (10% field presence)
   - Consistent performance across multiple runs
   - 96% of tests show improvement
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Co-authored with Claude Sonnet 4.5
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-55568][SQL] Separate schema construction from field stats collection [spark]

Reply via email to