[GitHub] spark issue #21617: [SPARK-24634][SS] Add a new metric regarding number of r...

HeartSaVioR Mon, 25 Jun 2018 15:23:58 -0700

Github user HeartSaVioR commented on the issue:

    https://github.com/apache/spark/pull/21617
  
    @jose-torres 
    Yes, you're right. They would be the rows which applies other 
transformation and filtering, not origin input rows. I just haven't find proper 
alternative word than "input row" since in point of state operator's view, 
they're input rows.
    
    Btw, as I described in the JIRA, my final goal is pushing late events to 
side-output (as Beam and Flink represented) but being stuck with couple of 
concerns (Please correct me anytime if I'm missing here):
    
    1. Which events to push?
    
    Query can have couple of transformations before reaching stateful operator 
and being filtered out due to watermark. This is not ideal and I guess that's 
you said as "aren't necessarily the input rows".  
    
    Ideally we would be better to provide origin input rows, rather than 
transformed one, but then we should put major restriction on watermark: `Filter 
with watermark` should be applied in data reader (or having a filter just after 
data reader), which means input rows itself should have timestamp field. 
    
    We can't apply transformation(s) to populate/manipulate timestamp field, 
and timestamp field **must not** be modified during transformations. For 
example, Flink provides timestamp assigner to extract timestamp value from 
input stream, and reserved field name `rowtime` is used for timestamp field.
    
    2. Does the nature of RDD support multiple outputs?
    
    I have been struggling on this, but as far as my understanding is correct, 
RDD itself doesn't support multiple outputs, as the nature of RDD. For me, this 
looks like major difference between pull model vs push model, cause in push 
model which other streaming frameworks use, defining another output stream is 
really straightforward, just like adding remote listener, whereas I'm not sure 
how it can be clearly defined in pull model. I also googled about multiple 
outputs on RDD (as someone could have struggled before) but no luck.
    
    The alternative approaches I can imagine are kinds of workarounds: RPC, 
listener bus, callback function. Nothing can define another stream within 
current DAG, and I'm also not sure that we can create DataFrame based on the 
data and let end users compose another query. 
    
    It would be really helpful if you can think about better alternatives and 
share.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21617: [SPARK-24634][SS] Add a new metric regarding number of r...

Reply via email to