Streaming Partitioned Writing

Dave Sugden Mon, 07 Oct 2019 08:15:30 -0700

Hi,

We are using the iceberg spark datasource with spark structured streaming.
The issue with this is, of course, the problem that the incoming partitions
are not sorted.
We have implemented our own streaming partition writer (extending
DataSourceWriter, StreamWriter).


We started by keeping the FileAppenders open, writing InternalRows to them
as they come in, but noticed we OOM'd with this approach, as the
FileAppender seems to allocate a large chunk of memory.

Our new approach is:
1) Buffer all incoming InternalRow objects (using row.copy() ), bucketed by
PartitionKey
2) On Commit, create FileAppender for each partition, one at a time,
writing all rows for that appender.

This approach seems to work, but we'd like to know if this is crazy, or if
here be dragons :).

Is there a superior way to do this?


It would appear InternalRow.copy() creates an actual copy.

Thanks,

dave

Streaming Partitioned Writing

Reply via email to