wombatu-kun opened a new pull request, #16658: URL: https://github.com/apache/iceberg/pull/16658
`JsonToMapUtils` decided whether a JSON array had a uniform element type by building a `HashSet` and inspecting its size: `arrayNodeType` collected every element's class into a set, and the nested-array branch of `arraySchema` collected every element's inferred `Schema` into a set. Both allocate a set and visit the entire array even when the second element already proves the types are mixed; the nested case is worse because it calls the (recursive, allocating) `schemaFromNode` on every element before deciding. This replaces both with a single pass that tracks the first element's type/schema and bails out on the first one that differs. No set is allocated, and for the nested case the early exit also skips `schemaFromNode` on the remaining elements. Behavior is unchanged (same result for empty, uniform, and mixed arrays). **When this runs:** these are on the per-record path of the `JsonToMapTransform` SMT. When a connector is configured with that transform (its purpose is to land arbitrary, schema-less JSON into a typed Connect schema), Kafka Connect calls `apply()` for every record, which infers a schema from the record's JSON via `schemaFromNode`. `arrayNodeType` is invoked once per JSON array field per record, and the nested-array branch once per array-of-arrays field per record. So the cost is paid for every array-typed field of every record the transform processes; it is not incurred when the transform is not configured. There are two independent sources of the gain. (1) For a uniform array (every element the same type, the common case) the scan still runs to the end, so the win comes purely from not allocating a `HashSet` and not hashing each element: the single pass does one reference `equals` per element instead of a `hashCode` + bucket lookup + set insert. (2) For a mixed array the single pass additionally bails out at the first differing element, skipping the rest of the scan (and, for nested arrays, the remaining `schemaFromNode` calls). A throwaway A/B microbench over the whole methods (1M iterations for arrayNodeType, 50k for the nested branch, x 9 trials, median; baseline calls the real production methods, including real `schemaFromNode`). In the tables, **uniform** = all elements share one type (full scan, no early exit) and **mixed@2** = the type first differs at element 2 (early exit possible): arrayNodeType (set of element classes): | array | before | after | faster | |---|---|---|---| | uniform, 10 | 110.3 ns | 12.0 ns | 89% | | uniform, 100 | 902.3 ns | 68.8 ns | 92% | | mixed@2, 10 | 119.8 ns | 6.7 ns | 94% | | mixed@2, 100 | 1069.4 ns | 6.7 ns | 99% | nested-array branch (set of element schemas): | array | before | after | faster | |---|---|---|---| | uniform, 10 | 1576.3 ns | 1038.5 ns | 34% | | uniform, 50 | 7594.4 ns | 5268.1 ns | 31% | | mixed@2, 10 | 1414.9 ns | 229.5 ns | 84% | | mixed@2, 50 | 6749.1 ns | 208.3 ns | 97% | Note the uniform rows: even with no early exit, removing the set/hashing alone is ~90% (arrayNodeType) and ~30% (nested) faster; the mixed rows add the early-exit win on top. The numbers are wall-clock from a microbench, not JMH. Existing `TestJsonToMapUtils` covers array, nested-array, and mixed-type cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
