[
https://issues.apache.org/jira/browse/METRON-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15768215#comment-15768215
]
ASF GitHub Bot commented on METRON-637:
---------------------------------------
Github user mattf-horton commented on a diff in the pull request:
https://github.com/apache/incubator-metron/pull/401#discussion_r93522344
--- Diff:
metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java
---
@@ -425,4 +428,74 @@ public Object apply(List<Object> args) {
return result;
}
}
+
+ /**
+ * Calculates the statistical bin that a value falls in.
+ */
+ @Stellar(namespace = "STATS", name = "BIN"
+ , description = "Computes the bin that the value is in based on
the statistical distribution."
+ , params = {
+ "stats - The Stellar statistics object"
+ , "value - The value to bin"
+ , "range? - A list of percentile bin ranges (excluding min and
max) or a string representing a known and common set of bins. " +
+ "For convenience, we have provided QUARTILE, QUINTILE, and
DECILE which you can pass in as a string arg." +
+ " If this argument is omitted, then we assume a Quartile bin
split."
+ }
+ , returns = "Which bin the value falls in such that bin < value
< bin + 1"
+ )
+ public static class Bin extends BaseStellarFunction {
+ public enum BinSplits {
+ QUARTILE(ImmutableList.of(25.0, 50.0, 75.0)),
+ QUINTILE(ImmutableList.of(20.0, 40.0, 60.0, 80.0)),
+ DECILE(ImmutableList.of(10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0,
80.0, 90.0))
+ ;
+ public final List<Double> split;
+ BinSplits(List<Double> split) {
+ this.split = split;
+ }
+
+ public static List<Double> getSplit(Object o) {
+ if(o instanceof String) {
+ return BinSplits.valueOf((String)o).split;
+ }
+ else if(o instanceof List) {
+ List<Double> ret = new ArrayList<>();
+ for(Object valO : (List<Object>)o) {
+ ret.add(ConversionUtils.convert(valO, Double.class));
+ }
+ return ret;
+ }
+ throw new IllegalStateException("The split you tried to pass is
not a valid split: " + o.toString());
+ }
+ }
+
+
+ @Override
+ public Object apply(List<Object> args) {
+ StatisticsProvider stats = convert(args.get(0),
StatisticsProvider.class);
+ Double value = convert(args.get(1), Double.class);
+ List<Double> bins = BinSplits.QUARTILE.split;
+ if (args.size() > 2) {
+ bins = BinSplits.getSplit(args.get(2));
+ }
+ if (stats == null || value == null || bins.size() == 0) {
+ return -1;
+ }
+
+ double prevPctile = stats.getPercentile(bins.get(0));
+
+ if(value <= prevPctile) {
+ return 0;
+ }
+ for(int bin = 1; bin < bins.size();++bin) {
+ double pctile = stats.getPercentile(bins.get(bin));
+ if(value > prevPctile && value <= pctile) {
--- End diff --
Really no need to check the lower bound here, is there? :-)
Thus, you can dump "prevPctile" and start at bin = 0.
> Add a STATS_BIN function to Stellar.
> ------------------------------------
>
> Key: METRON-637
> URL: https://issues.apache.org/jira/browse/METRON-637
> Project: Metron
> Issue Type: Improvement
> Reporter: Casey Stella
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> When passing parameters to models, it's often useful to pass the binned
> representation of a variable based on an empirical statistical distribution,
> rather than the actual variable. This function should accept a set of
> percentile bins and a statistical sketch and a value. It should return the
> index where the percentile of the value falls.
> For instance, consider the value 17 who is percentile 27. If we use 25, 75,
> 95 to define our bins, this function would return 1, because its percentile,
> 27, is between 25 and 75.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)