[ 
https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386568#comment-14386568
 ] 

Sean Owen commented on SPARK-6605:
----------------------------------

Thanks, that's very useful. I think the behavior is expected, but it's not 
obvious. I assume you are printing the RDD from a window with no data.

Both are giving the same answer in that both show no count, or 0, for every 
key. The second example just has an explicit 0 in two cases instead of all 
implicit 0. The more expected answer is the first one -- no results. The first 
version gets that exactly since it re-counts the whole window which has no data.

The second one is the result of the optimization offered by invFunc. It 
correctly finds the count is 0 in the current window for these two keys, but it 
has no notion that a count of 0 is the same as no value at all. You and I know 
that, and you could simply apply a filter() to remove these redundant entries 
if desired.

I'm not sure it's "fixable" in general without the user being able to supply a 
{{(V,V) => Option[V]}} instead or something as the {{invFunc}}. But it's not 
really getting the wrong answer either.

> Same transformation in DStream leads to different result
> --------------------------------------------------------
>
>                 Key: SPARK-6605
>                 URL: https://issues.apache.org/jira/browse/SPARK-6605
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.3.0
>            Reporter: SaintBacchus
>             Fix For: 1.4.0
>
>
> The transformation *reduceByKeyAndWindow* has two implementations: one use 
> the *WindowDstream* and the other use *ReducedWindowedDStream*.
> But the result always is the same, except when an empty windows occurs.
> As a wordcount example, if a period of time (larger than window time) has no 
> data coming, the first *reduceByKeyAndWindow*  has no elem inside but the 
> second has many elem with the zero value inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to