[
https://issues.apache.org/jira/browse/METRON-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15769421#comment-15769421
]
Matt Foley commented on METRON-637:
-----------------------------------
This is a really nice new feature. It works and is clean. It rates a +1. But I
think we are likely to be processing thousands or perhaps millions of data
points at any given time, and so the constant re-parsing of the bounds list is
troublesome. Also for the STATS_BIN function, the percentile bounds list should
be transformed once into a value bounds list, rather than applying the
boundFunc many times to (on average) half the bounds in the list, per input
value.
Unlike the Profiler, there is no config caching between calls to the Stellar
Function, because this "configuration" is done every time, in-line, rather than
in ZK. But we don't want all the complexity of ZK just for this little binning
function. And we want multiple different binning functions to be in use at the
same time, without needing complex scope management.
One solution would be to treat it like Python does with regex, and provide a
compiler function. What if we have
COMPILE_BIN(bounds) and COMPILE_STATS_BIN(stats, bounds)
Each would return an opaque key (or an integer) that references a cached
pre-parsed setup; it can be thread-safe as it would be read-only. Then we would
invoke with
BIN(key, value) and STATS_BIN(key, value)
Normally I would say this is an optimization and we should leave it for later.
But then we would be stuck with the inefficient, non-compiled form of the BIN
and STATS_BIN functions.
Your call, @cestella . I don't want to get in the way of progress, I just feel
obligated to bring it up.
--Matt
> Add a STATS_BIN function to Stellar.
> ------------------------------------
>
> Key: METRON-637
> URL: https://issues.apache.org/jira/browse/METRON-637
> Project: Metron
> Issue Type: Improvement
> Reporter: Casey Stella
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> When passing parameters to models, it's often useful to pass the binned
> representation of a variable based on an empirical statistical distribution,
> rather than the actual variable. This function should accept a set of
> percentile bins and a statistical sketch and a value. It should return the
> index where the percentile of the value falls.
> For instance, consider the value 17 who is percentile 27. If we use 25, 75,
> 95 to define our bins, this function would return 1, because its percentile,
> 27, is between 25 and 75.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)