wombatu-kun opened a new pull request, #16791: URL: https://github.com/apache/iceberg/pull/16791
`FlinkParquetWriters`' `ArrayDataWriter` and `MapDataWriter` allocate per row when writing array and map columns. For every value, `elements(...)` / `pairs(...)` create a fresh iterator, whose constructor in turn calls `ArrayData.createElementGetter(...)` for the element (and both key and value for maps). For nullable element/value types `createElementGetter` returns a capturing null-checking wrapper, so it allocates too; the map iterator additionally allocates a `ReusableEntry` per row. The element/key/value getters depend only on the column types, which are fixed at writer construction, so they are now built once in the writer. The iterators themselves are reused: the parent `RepeatedWriter` / `RepeatedKeyValueWriter` fully consumes the iterator inside a single `write()` call and never retains it (it drains it in a `while` loop), and writers are single-threaded, so one reusable iterator instance per writer (reset on each call, with the map's `ReusableEntry` allocated once) is safe. Nested collections use distinct writer instances, so no iterator is ever re-entered. The change is identical across the supported Flink versions, so it is applied to v1.20, v2.0, and v2.1 in this PR. ### Benchmark JMH (JDK 17, `-prof gc`, SingleShotTime), end-to-end Flink Parquet write of 1,000,000 rows of `id: long, tags: array<int>, props: map<string, long>`, measured for non-nullable and nullable (optional) element/value types. | Collection elements | Before | After | Delta | | --- | --- | --- | --- | | non-nullable | 508.3 MB/op | 436.3 MB/op | -14.2% | | nullable | 421.7 MB/op | 349.7 MB/op | -17.1% | (Allocation per 1,000,000-row write.) Wall-clock time was unchanged within noise; this is an allocation / GC-pressure reduction on collection-heavy writes. Existing `TestFlinkParquetWriter` / `TestFlinkParquetReader` round-trip coverage (arrays, maps, nested structs, required and optional, dictionary and fallback encodings) passes unchanged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
