[ https://issues.apache.org/jira/browse/SPARK-12353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-12353: ------------------------------ Assignee: Saisai Shao > wrong output for countByValue and countByValueAndWindow > ------------------------------------------------------- > > Key: SPARK-12353 > URL: https://issues.apache.org/jira/browse/SPARK-12353 > Project: Spark > Issue Type: Bug > Components: Documentation, Input/Output, PySpark, Streaming > Affects Versions: 1.5.2 > Environment: Ubuntu 14.04, Python 2.7.6 > Reporter: Bo Jin > Assignee: Saisai Shao > Labels: releasenotes > Fix For: 2.0.0 > > Original Estimate: 2h > Remaining Estimate: 2h > > http://stackoverflow.com/q/34114585/4698425 > In PySpark Streaming, function countByValue and countByValueAndWindow return > one single number which is the count of distinct elements, instead of a list > of (k,v) pairs. > It's inconsistent with the documentation: > countByValue: When called on a DStream of elements of type K, return a new > DStream of (K, Long) pairs where the value of each key is its frequency in > each RDD of the source DStream. > countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a > new DStream of (K, Long) pairs where the value of each key is its frequency > within a sliding window. Like in reduceByKeyAndWindow, the number of reduce > tasks is configurable through an optional argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org