Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/21617
@jose-torres
Yes, you're right. They would be the rows which applies other
transformation and filtering, not origin input rows. I just haven't find proper
alternative word than "input row" since in point of state operator's view,
they're input rows.
Btw, as I described in the JIRA, my final goal is pushing late events to
side-output (as Beam and Flink represented) but being stuck with couple of
concerns (Please correct me anytime if I'm missing here):
1. Which events to push?
Query can have couple of transformations before reaching stateful operator
and being filtered out due to watermark. This is not ideal and I guess that's
you said as "aren't necessarily the input rows".
Ideally we would be better to provide origin input rows, rather than
transformed one, but then we should put major restriction on watermark: `Filter
with watermark` should be applied in data reader (or having a filter just after
data reader), which means input rows itself should have timestamp field.
We can't apply transformation(s) to populate/manipulate timestamp field,
and timestamp field **must not** be modified during transformations. For
example, Flink provides timestamp assigner to extract timestamp value from
input stream, and reserved field name `rowtime` is used for timestamp field.
2. Does the nature of RDD support multiple outputs?
I have been struggling on this, but as far as my understanding is correct,
RDD itself doesn't support multiple outputs, as the nature of RDD. For me, this
looks like major difference between pull model vs push model, cause in push
model which other streaming frameworks use, defining another output stream is
really straightforward, just like adding remote listener, whereas I'm not sure
how it can be clearly defined in pull model. I also googled about multiple
outputs on RDD (as someone could have struggled before) but no luck.
The alternative approaches I can imagine are kinds of workarounds: RPC,
listener bus, callback function. Nothing can define another stream within
current DAG, and I'm also not sure that we can create DataFrame based on the
data and let end users compose another query.
It would be really helpful if you can think about better alternatives and
share.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]