The approach sounds okay to me. It's usually preferable to repartition the data by your partition dimensions to keep the number of data files that each writer needs to create to a minimum.
Also, if buffering in memory starts taking too much memory, you can switch to using Avro instead of Parquet for the data written by your streaming job. Avro doesn't buffer near as much in memory, so it will have a much lighter memory footprint and you'd be able to keep more open files. That's one reason why we included it in the spec. But repartitioning your data is the better solution if you can do that. On Mon, Oct 7, 2019 at 8:14 AM Dave Sugden <dave.sug...@shopify.com.invalid> wrote: > Hi, > > We are using the iceberg spark datasource with spark structured streaming. > The issue with this is, of course, the problem that the incoming partitions > are not sorted. > We have implemented our own streaming partition writer (extending > DataSourceWriter, StreamWriter). > > We started by keeping the FileAppenders open, writing InternalRows to them > as they come in, but noticed we OOM'd with this approach, as the > FileAppender seems to allocate a large chunk of memory. > > Our new approach is: > 1) Buffer all incoming InternalRow objects (using row.copy() ), bucketed > by PartitionKey > 2) On Commit, create FileAppender for each partition, one at a time, > writing all rows for that appender. > > This approach seems to work, but we'd like to know if this is crazy, or if > here be dragons :). > > Is there a superior way to do this? > > > It would appear InternalRow.copy() creates an actual copy. > > Thanks, > > dave > > -- Ryan Blue Software Engineer Netflix