Github user jose-torres commented on the issue:
https://github.com/apache/spark/pull/21617
Well, "clear" is relative. Since we're trying to provide functionality in
the Dataframe API, it's perfectly alright for the RDD graph to end up looking a
bit weird. It seems feasible to do something like:
* Have a stream reader RDD write side output to some special shuffle
partition (set of partitions?) which the main query knows not to read.
* Have a stream writer RDD with two heterogeneous sets of partitions: one
to write the main query to the sink, and another to apply the specified action
to the side output.
I agree that watermarks should be applied immediately after the data reader
- other streaming systems generally require this, and Spark does not seem to be
getting any benefits from having a more general watermark concept. I haven't
had time to push for this change, but I think it's known that the current Spark
watermark model is flawed - I'd support fixing it for sure.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]