Hi
*Do we have any option to make streaming queries with multiple stateful
operations output data without waiting this extra iteration? One of my
ideas was to force an empty microbatch to run and propagate late events
watermark without any new data. While this conceptually works, I didn't
find a
It might be good to first split the stream up into smaller streams, one per
type. If ordering of the Kafka records is important, then you could
partition them at the source based on the type, but be careful how you
configure Spark to read from Kafka as that could also influence ordering.
kdf
Use an intermediate work table to put json data streaming in there in the
first place and then according to the tag store the data in the correct
table
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
Use foreachBatch or foreach methods:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
On Wed, 10 Jan 2024, 17:42 PRASHANT L, wrote:
> Hi
> I have a use case where I need to process json payloads coming from Kafka
> using structured
Hi
I have a use case where I need to process json payloads coming from Kafka
using structured streaming , but thing is json can have different formats ,
schema is not fixed
and each json will have a @type tag so based on tag , json has to be parsed
and loaded to table with tag name , and if a
Yes, I agree. But apart from maintaining this state internally (in memory
or in memory+disk as in case of RocksDB), every trigger it saves some
information about this state in a checkpoint location. I'm afraid we can't
do much about this checkpointing operation. I'll continue looking for
Hi,
You may have a point on scenario 2.
Caching Streaming DataFrames: In Spark Streaming, each batch of data is
processed incrementally, and it may not fit the typical caching we
discussed. Instead, Spark Streaming has its mechanisms to manage and
optimize the processing of streaming data. Case
I'm struggling with the following issue in Spark >=3.4, related to multiple
stateful operations.
When spark.sql.streaming.statefulOperator.allowMultiple is enabled, Spark
keeps track of two types of watermarks: eventTimeWatermarkForEviction and
eventTimeWatermarkForLateEvents. Introducing them
Hey,
Yes, that's how I understood it (scenario 1). However, I'm not sure if
scenario 2 is possible. I think cache on streaming DataFrame is supported
only in forEachBatch (in which it's actually no longer a streaming DF).
śr., 10 sty 2024 o 15:01 Mich Talebzadeh
napisał(a):
> Hi,
>
> With
Hi,
With regard to your point
- Caching: Can you please explain what you mean by caching? I know that
when you have batch and streaming sources in a streaming query, then you
can try to cache batch ones to save on reads. But I'm not sure if it's what
you mean, and I don't know how to apply what
Thank you very much for your suggestions. Yes, my main concern is
checkpointing costs.
I went through your suggestions and here're my comments:
- Caching: Can you please explain what you mean by caching? I know that
when you have batch and streaming sources in a streaming query, then you
can try
All, the only documentation about the File Metadata ( hidden_metadata struct) I can seem to find is on the databricks website https://docs.databricks.com/en/ingestion/file-metadata-column.html#file-metadata-column for reference here is the struct:_metadata: struct (nullable = false) |-- file_path:
13 matches
Mail list logo