GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2091
[SPARK-2871] [PySpark] add histgram() API
Compute a histogram using the provided buckets. The buckets
are all open to the right except for the last which is closed.
e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
and 50 we would have a histogram of 1,0,1.
If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
this can be switched from an O(log n) inseration to O(1) per
element(where n = # buckets), if you set `even` to True.
Buckets must be sorted and not contain any duplicates, must be
at least two elements.
If `buckets` is a number, it will generates buckets which is
evenly spaced between the minimum and maximum of the RDD. For
example, if the min value is 0 and the max is 100, given buckets
as 2, the resulting buckets will be [0,50) [50,100]. buckets must
be at least 1 If the RDD contains infinity, NaN throws an exception
If the elements in RDD do not vary (max == min) always returns
a single bucket.
It will return an tuple of buckets and histogram.
>>> rdd = sc.parallelize(range(51))
>>> rdd.histogram(2)
([0, 25, 50], [25, 26])
>>> rdd.histogram([0, 5, 25, 50])
([0, 5, 25, 50], [5, 20, 26])
>>> rdd.histogram([0, 15, 30, 45, 60], True)
([0, 15, 30, 45, 60], [15, 15, 15, 6])
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/davies/spark histgram
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2091.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2091
----
commit 0e18a2d3e98ca940320153b192fa03306cfd8f9a
Author: Davies Liu <[email protected]>
Date: 2014-08-22T04:33:38Z
add histgram() API
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]