wombatu-kun opened a new pull request, #16658:
URL: https://github.com/apache/iceberg/pull/16658

   `JsonToMapUtils` decided whether a JSON array had a uniform element type by 
building a `HashSet` and inspecting its size: `arrayNodeType` collected every 
element's class into a set, and the nested-array branch of `arraySchema` 
collected every element's inferred `Schema` into a set. Both allocate a set and 
visit the entire array even when the second element already proves the types 
are mixed; the nested case is worse because it calls the (recursive, 
allocating) `schemaFromNode` on every element before deciding.
   
   This replaces both with a single pass that tracks the first element's 
type/schema and bails out on the first one that differs. No set is allocated, 
and for the nested case the early exit also skips `schemaFromNode` on the 
remaining elements. Behavior is unchanged (same result for empty, uniform, and 
mixed arrays).
   
   **When this runs:** these are on the per-record path of the 
`JsonToMapTransform` SMT. When a connector is configured with that transform 
(its purpose is to land arbitrary, schema-less JSON into a typed Connect 
schema), Kafka Connect calls `apply()` for every record, which infers a schema 
from the record's JSON via `schemaFromNode`. `arrayNodeType` is invoked once 
per JSON array field per record, and the nested-array branch once per 
array-of-arrays field per record. So the cost is paid for every array-typed 
field of every record the transform processes; it is not incurred when the 
transform is not configured.
   
   There are two independent sources of the gain. (1) For a uniform array 
(every element the same type, the common case) the scan still runs to the end, 
so the win comes purely from not allocating a `HashSet` and not hashing each 
element: the single pass does one reference `equals` per element instead of a 
`hashCode` + bucket lookup + set insert. (2) For a mixed array the single pass 
additionally bails out at the first differing element, skipping the rest of the 
scan (and, for nested arrays, the remaining `schemaFromNode` calls).
   
   A throwaway A/B microbench over the whole methods (1M iterations for 
arrayNodeType, 50k for the nested branch, x 9 trials, median; baseline calls 
the real production methods, including real `schemaFromNode`). In the tables, 
**uniform** = all elements share one type (full scan, no early exit) and 
**mixed@2** = the type first differs at element 2 (early exit possible):
   
   arrayNodeType (set of element classes):
   
   | array | before | after | faster |
   |---|---|---|---|
   | uniform, 10 | 110.3 ns | 12.0 ns | 89% |
   | uniform, 100 | 902.3 ns | 68.8 ns | 92% |
   | mixed@2, 10 | 119.8 ns | 6.7 ns | 94% |
   | mixed@2, 100 | 1069.4 ns | 6.7 ns | 99% |
   
   nested-array branch (set of element schemas):
   
   | array | before | after | faster |
   |---|---|---|---|
   | uniform, 10 | 1576.3 ns | 1038.5 ns | 34% |
   | uniform, 50 | 7594.4 ns | 5268.1 ns | 31% |
   | mixed@2, 10 | 1414.9 ns | 229.5 ns | 84% |
   | mixed@2, 50 | 6749.1 ns | 208.3 ns | 97% |
   
   Note the uniform rows: even with no early exit, removing the set/hashing 
alone is ~90% (arrayNodeType) and ~30% (nested) faster; the mixed rows add the 
early-exit win on top. The numbers are wall-clock from a microbench, not JMH.
   
   Existing `TestJsonToMapUtils` covers array, nested-array, and mixed-type 
cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to