GitHub user wzhfy opened a pull request:
https://github.com/apache/spark/pull/15637
[SPARK-18000] [SQL] Aggregation function for computing endpoints for
histograms
## What changes were proposed in this pull request?
This function for a column returns bins - (distinct value, frequency) pairs
of equi-width histogram when the number of distinct values is less than or
equal to the
specified maximum number of bins. Otherwise, for column of string type, it
returns an empty
map; for column of numeric type, it returns endpoints of equi-height
histogram - approximate
percentiles at percentages 0.0, 1/numBins, 2/numBins, ...,
(numBins-1)/numBins, 1.0.
## How was this patch tested?
add test cases
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wzhfy/spark histogramEndpoints
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15637.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15637
----
commit 1580a7fb0a6a21b603e754c744c8c6cb02fd24c2
Author: wangzhenhua <[email protected]>
Date: 2016-10-12T01:02:37Z
add agg function to generate string histogram
commit a3281606372f83eca960ea90734e8ee9cb1c3125
Author: wangzhenhua <[email protected]>
Date: 2016-10-21T08:10:01Z
comments
commit 907cd99b8b26ae3caa224df67cc10bc784f10fb4
Author: Zhenhua Wang <[email protected]>
Date: 2016-10-22T12:15:22Z
create HistogramEndpoints to generate endpoints for string and numeric types
commit 35e453cb1079398196ece4f13e8f294ee4e4e916
Author: Zhenhua Wang <[email protected]>
Date: 2016-10-23T15:05:09Z
change suite names
commit f6fe25de3f5f1382727cecfdda7b74e40758896b
Author: wangzhenhua <[email protected]>
Date: 2016-10-26T03:20:05Z
test cases and fix bugs
commit 15eb3721f56ac27bd90933ef7e66f3453eae4a75
Author: wangzhenhua <[email protected]>
Date: 2016-10-26T03:29:14Z
fix doc
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]