The approach sounds okay to me. It's usually preferable to repartition the
data by your partition dimensions to keep the number of data files that
each writer needs to create to a minimum.

Also, if buffering in memory starts taking too much memory, you can switch
to using Avro instead of Parquet for the data written by your streaming
job. Avro doesn't buffer near as much in memory, so it will have a much
lighter memory footprint and you'd be able to keep more open files. That's
one reason why we included it in the spec. But repartitioning your data is
the better solution if you can do that.

On Mon, Oct 7, 2019 at 8:14 AM Dave Sugden <dave.sug...@shopify.com.invalid>
wrote:

> Hi,
>
> We are using the iceberg spark datasource with spark structured streaming.
> The issue with this is, of course, the problem that the incoming partitions
> are not sorted.
> We have implemented our own streaming partition writer (extending
> DataSourceWriter, StreamWriter).
>
> We started by keeping the FileAppenders open, writing InternalRows to them
> as they come in, but noticed we OOM'd with this approach, as the
> FileAppender seems to allocate a large chunk of memory.
>
> Our new approach is:
> 1) Buffer all incoming InternalRow objects (using row.copy() ), bucketed
> by PartitionKey
> 2) On Commit, create FileAppender for each partition, one at a time,
> writing all rows for that appender.
>
> This approach seems to work, but we'd like to know if this is crazy, or if
> here be dragons :).
>
> Is there a superior way to do this?
>
>
> It would appear InternalRow.copy() creates an actual copy.
>
> Thanks,
>
> dave
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to