Should I try "fanout-enabled" option within foreachBatch method where I do
dataframe.write ?

On Wed, Aug 30, 2023 at 10:29 AM Nirav Patel <nira...@gmail.com> wrote:

> Hi,
>
> I am using spark structured streaming and using foreachBatch sink to
> append to iceberg dual hidden partitioned table.
> I got this infamous error about input dataframe or partition needing to be
> clustered:
>
> *Incoming records violate the writer assumption that records are clustered
> by spec and by partition within each spec. Either cluster the incoming
> records or switch to fanout writers.*
>
> I tried setting "fanout-enabled" to "true" before calling foreachBatch but
> it didnt work at all. Got same error.
>
> I tried partitionedBy(days("date"), col("customerid")) and that didn't
> work either.
>
> Then I used spark sql approach:
> INSERT INTO {dest_schema_fqn}
>                 SELECT * from {success_agg_tbl} order by date(date), tenant
>
> and that worked.
>
> I know of following table level config:
> write.spark.fanout.enabled - False
> write.distribution-mode - None
> but I have left it to defaults as I assume writer will override those
> settings.
>
> so do "fanout-enabled" option have effect when using with foreachBatch?
> (I'm new to spark streaming as well)
>
> thanks
>

Reply via email to