Re: Enhanced Console Sink for Structured Streaming

Jungtaek Lim Mon, 05 Feb 2024 18:49:47 -0800

Maybe we could keep the default as it is, and explicitly turn on
verboseMode to enable auxiliary information. I'm not a believer that anyone
will parse the output of console sink (which means this could be a breaking
change), but changing the default behavior should be taken conservatively.
We can highlight the mode on the guide doc, which would be good enough to
publicize the improvement.


Other than that, the proposal looks good to me. Adding some more details
may be appropriate - e.g. what if there are multiple stateful operators,
what if there are 100 state rows in the state store, etc. One sketched idea
is to employ multiple verbosity levels and list up all state store rows in
full verbosity, otherwise maybe the number of state store rows. This is
just one example for the details.

On Sun, Feb 4, 2024 at 3:22 AM Neil Ramaswamy
<neil.ramasw...@databricks.com.invalid> wrote:

> Re: verbosity: yes, it will be more verbose. A config I was planning to
> implement was a default-on console sink option, verboseMode, that you can
> set to be off if you just want sink data. I don't think that introduces
> additional complexity, as the last point suggests. (And also, nobody should
> be using this for "high data throughput" scenarios or
> "performance-sensitive applications". It's a development sink.)
>
> I don't think that exposing these details increases the learning curve:
> these details are *essential *for understanding how Structured Streaming
> works. I'd actually argue that it makes the learning curve shallower: by
> showing the few variables that affect the behavior of their pipelines,
> they'll have the conceptual understanding to answer essential questions
> like "why aren't my results showing up?" or "why is my state size always
> increasing?"
>
> Also: for stateless pipelines, none of this event-time and state detail
> applies. We would just render sink data—no behavior change from today. That
> seems gentle enough to me: start with stateless pipelines and see
> the output rows, but when you advance to stateful pipelines, you need to
> deal with the two complexities (event-time and state) of stateful streaming.
>
> On Sat, Feb 3, 2024 at 3:08 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi,
>>
>> As I understood, the proposal you mentioned suggests adding event-time
>> and state store metadata to the console sink to better highlight the
>> semantics of the Structured Streaming engine. While I agree this
>> enhancement can provide valuable insights into the engine's behavior
>> especially for newcomers, there are potential challenges that we need to be
>> aware of:
>>
>> - Including additional metadata in the console sink output can increase
>> the volume of information printed. This might result in a more verbose
>> console output, making it harder to observe the actual data from the
>> metadata, especially in scenarios with high data throughput.
>> - Added verbosity, the proposed additional metadata may make the console
>> output more verbose, potentially affecting its readability, especially for
>> users who are primarily interested in the processed data and not the
>> internal engine details.
>> - Users unfamiliar with the internal workings of Structured Streaming
>> might misinterpret the metadata as part of the actual data, leading to
>> confusion.
>> - The act of printing additional metadata to the console may introduce
>> some overhead, especially in scenarios where high-frequency updates occur.
>> While this overhead might be minimal, it is worth considering it in
>> performance-sensitive applications.
>> - While the proposal aims to make it easier for beginners to understand
>> concepts like watermarks, operator state, and output rows, it could
>> potentially increase the learning curve due to the introduction of
>> additional terminology and information.
>> - Users might benefit from the ability to selectively enable or disable
>> the display of certain metadata elements to tailor the console output to
>> their specific needs. However, this introduces additional complexity.
>>
>> As usual with these things, your mileage varies. Whilst the proposed
>> enhancements offer valuable insights into the behavior of Structured
>> Streaming, we ought to think about the potential downsides, particularly in
>> terms of increased verbosity, complexity, and the impact on user experience
>>
>> HTH
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 3 Feb 2024 at 01:32, Neil Ramaswamy
>> <neil.ramasw...@databricks.com.invalid> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to propose the idea of enhancing Structured Streaming's console
>>> sink to print event-time metrics and state store data, in addition to the
>>> sink's rows.
>>>
>>> I've noticed beginners often struggle to understand how watermarks,
>>> operator state, and output rows are all intertwined. By printing all of
>>> this information in the same place, I think that this sink will make it
>>> easier for users to see—and our docs to explain—how these concepts work
>>> together.
>>>
>>> For example, our docs could walk the users through a query with a
>>> 10-second tumbling window aggregation (e.g. with a .count()) and a 15
>>> second watermark. After processing something like (foo, 17) and (bar, 15),
>>> writing another record (baz, 36) to the source would cause the following to
>>> print for batch 2:
>>>
>>> +----------------------------------------+
>>>
>>> |      WRITES TO SINK (Batch = 2)        |
>>>
>>> +--------------------------+-------------+
>>>
>>> |          window          |   count     |
>>>
>>> +--------------------------+-------------+
>>>
>>> | {10 seconds, 20 seconds} |      2      |
>>>
>>> +--------------------------+-------------+
>>>
>>> |             EVENT TIME                 |
>>>
>>> +----------------------------------------+
>>>
>>> | watermark -> 21 seconds                |
>>>
>>> | numDroppedRows -> 0                    |
>>>
>>> +----------------------------------------+
>>>
>>> |             STATE ROWS                 |
>>>
>>> +--------------------------+-------------+
>>>
>>> |           key            |    value    |
>>>
>>> +--------------------------+-------------+
>>>
>>> | {30 seconds, 40 seconds} |     {1}     |
>>>
>>> +--------------------------+-------------+
>>>
>>> From this (especially with expository help), it would be more apparent
>>> that the record at 36 seconds did three things: it advanced the watermark
>>> to 36-15 = 21 seconds, caused the [10, 20] window to close, and was put
>>> into the state for [30, 40].
>>>
>>> One valid concern is that this sink would now be printing *metadata*,
>>> not just data: will users think that Structured Streaming writes metadata
>>> to sinks? Perhaps. But I think that we can clarify that in the
>>> documentation of the console sink.
>>>
>>> Finally, the specific behavior for handling queries with multiple
>>> stateful operations, joins, and (F)MGWS can be handled in a subsequent
>>> design discussion if the general idea is appreciated.
>>>
>>> *TLDR: I propose adding event-time and state store metadata to the
>>> console sink to better highlight the semantics of the Structured Streaming
>>> engine. *
>>>
>>> Neil
>>>
>>>
>>>
>>

Re: Enhanced Console Sink for Structured Streaming

Reply via email to