Hi,
I am using spark structured streaming and using foreachBatch sink to append
to iceberg dual hidden partitioned table.
I got this infamous error about input dataframe or partition needing to be
clustered:
*Incoming records violate the writer assumption that records are clustered
by spec and by partition within each spec. Either cluster the incoming
records or switch to fanout writers.*
I tried setting "fanout-enabled" to "true" before calling foreachBatch but
it didnt work at all. Got same error.
I tried partitionedBy(days("date"), col("customerid")) and that didn't work
either.
Then I used spark sql approach:
INSERT INTO {dest_schema_fqn}
SELECT * from {success_agg_tbl} order by date(date), tenant
and that worked.
I know of following table level config:
write.spark.fanout.enabled - False
write.distribution-mode - None
but I have left it to defaults as I assume writer will override those
settings.
so do "fanout-enabled" option have effect when using with foreachBatch?
(I'm new to spark streaming as well)
thanks