Hi, We are using the iceberg spark datasource with spark structured streaming. The issue with this is, of course, the problem that the incoming partitions are not sorted. We have implemented our own streaming partition writer (extending DataSourceWriter, StreamWriter).
We started by keeping the FileAppenders open, writing InternalRows to them as they come in, but noticed we OOM'd with this approach, as the FileAppender seems to allocate a large chunk of memory. Our new approach is: 1) Buffer all incoming InternalRow objects (using row.copy() ), bucketed by PartitionKey 2) On Commit, create FileAppender for each partition, one at a time, writing all rows for that appender. This approach seems to work, but we'd like to know if this is crazy, or if here be dragons :). Is there a superior way to do this? It would appear InternalRow.copy() creates an actual copy. Thanks, dave