Thanks guys!
I am using SSS to backfill the past 3 month data. I thought I can use SSS
for both history data and new data. I just realized that SSS is not
appropriate for backfilling since the watermark relies on receivedAt which
could be 3 month ago. I will use batch job for backfill and use SSS
The query makes state growing infinitely. Could you consider watermark
apply to "receivedAt" to let unnecessary part of state cleared out? Other
than watermark you could implement TTL based eviction via
flatMapGroupsWithState, though you'll need to implement your custom
"dropDuplicate".
2019년 3월 1
Use https://github.com/chermenin/spark-states instead
Am So., 10. März 2019 um 20:51 Uhr schrieb Arun Mahadevan :
>
> Read the link carefully,
>
> This solution is available (*only*) in Databricks Runtime.
>
> You can enable RockDB-based state management by setting the following
> configuration i
Read the link carefully,
This solution is available (*only*) in Databricks Runtime.
You can enable RockDB-based state management by setting the following
configuration in the SparkSession before starting the streaming query.
spark.conf.set(
"spark.sql.streaming.stateStore.providerClass",
"co
Hi,
I have a very simple SSS pipeline which does:
val query = df
.dropDuplicates(Array("Id", "receivedAt"))
.withColumn(timePartitionCol, timestamp_udfnc(col("receivedAt")))
.writeStream
.format("parquet")
.partitionBy("availabilityDomain", timePartitionCol)
.trigger(Trigger.Processin