Hi all,

I'd like to propose the idea of enhancing Structured Streaming's console
sink to print event-time metrics and state store data, in addition to the
sink's rows.

I've noticed beginners often struggle to understand how watermarks,
operator state, and output rows are all intertwined. By printing all of
this information in the same place, I think that this sink will make it
easier for users to see—and our docs to explain—how these concepts work
together.

For example, our docs could walk the users through a query with a 10-second
tumbling window aggregation (e.g. with a .count()) and a 15 second
watermark. After processing something like (foo, 17) and (bar, 15), writing
another record (baz, 36) to the source would cause the following to print
for batch 2:

+----------------------------------------+

|      WRITES TO SINK (Batch = 2)        |

+--------------------------+-------------+

|          window          |   count     |

+--------------------------+-------------+

| {10 seconds, 20 seconds} |      2      |

+--------------------------+-------------+

|             EVENT TIME                 |

+----------------------------------------+

| watermark -> 21 seconds                |

| numDroppedRows -> 0                    |

+----------------------------------------+

|             STATE ROWS                 |

+--------------------------+-------------+

|           key            |    value    |

+--------------------------+-------------+

| {30 seconds, 40 seconds} |     {1}     |

+--------------------------+-------------+

>From this (especially with expository help), it would be more apparent that
the record at 36 seconds did three things: it advanced the watermark to
36-15 = 21 seconds, caused the [10, 20] window to close, and was put into
the state for [30, 40].

One valid concern is that this sink would now be printing *metadata*, not
just data: will users think that Structured Streaming writes metadata to
sinks? Perhaps. But I think that we can clarify that in the documentation
of the console sink.

Finally, the specific behavior for handling queries with multiple stateful
operations, joins, and (F)MGWS can be handled in a subsequent design
discussion if the general idea is appreciated.

*TLDR: I propose adding event-time and state store metadata to the console
sink to better highlight the semantics of the Structured Streaming engine. *

Neil

Reply via email to