stevenzwu commented on issue #1643: URL: https://github.com/apache/iceberg/issues/1643#issuecomment-714811863
Regarding copying RowData from iterator to a fixed-size array by the `IcebergSourceSplitReader<T>`, let's first discuss some of the implementation details. - How would `IcebergSourceSplitReader<T>` know how to create a reuseable empty object array for a generic type T? - How would `IcebergSourceSplitReader<T>` know how to clone the generic type T returned by DataIterator.next()? Both Flink RowData and Avro GenericRecord don't implement `Cloneable` interface. RowDataIterator and its underneath Flink*Reader have the knowledge on how to achieve the correct object reuse semantic. Assuming `IcebergSourceSplitReader<T>` can implement proper cloning to reusable array, we should still discuss the reuse semantic for records emitted by Iceberg source. Not sure if Flink defines the reuse semantic for the records emitted by source or it is up to the source implementation. If it is up to the source and we decided that Iceberg source should only emit reused records, we should clearly document the expectation. E.g. users could have a chained downstream operator (no serialization) that performs de-dup aggregation in memory. When it first time saw a key, it keep the record/row for the key. If there is another row coming for the same key, it simply update a field in the first received row. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
