[ 
https://issues.apache.org/jira/browse/METRON-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15769421#comment-15769421
 ] 

Matt Foley commented on METRON-637:
-----------------------------------

This is a really nice new feature. It works and is clean. It rates a +1. But I 
think we are likely to be processing thousands or perhaps millions of data 
points at any given time, and so the constant re-parsing of the bounds list is 
troublesome. Also for the STATS_BIN function, the percentile bounds list should 
be transformed once into a value bounds list, rather than applying the 
boundFunc many times to (on average) half the bounds in the list, per input 
value.

Unlike the Profiler, there is no config caching between calls to the Stellar 
Function, because this "configuration" is done every time, in-line, rather than 
in ZK. But we don't want all the complexity of ZK just for this little binning 
function. And we want multiple different binning functions to be in use at the 
same time, without needing complex scope management.

One solution would be to treat it like Python does with regex, and provide a 
compiler function. What if we have
COMPILE_BIN(bounds) and COMPILE_STATS_BIN(stats, bounds)
Each would return an opaque key (or an integer) that references a cached 
pre-parsed setup; it can be thread-safe as it would be read-only. Then we would 
invoke with
BIN(key, value) and STATS_BIN(key, value)

Normally I would say this is an optimization and we should leave it for later. 
But then we would be stuck with the inefficient, non-compiled form of the BIN 
and STATS_BIN functions.

Your call, @cestella . I don't want to get in the way of progress, I just feel 
obligated to bring it up.
--Matt

> Add a STATS_BIN function to Stellar.
> ------------------------------------
>
>                 Key: METRON-637
>                 URL: https://issues.apache.org/jira/browse/METRON-637
>             Project: Metron
>          Issue Type: Improvement
>            Reporter: Casey Stella
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When passing parameters to models, it's often useful to pass the binned 
> representation of a variable based on an empirical statistical distribution, 
> rather than the actual variable.  This function should accept a set of 
> percentile bins and a statistical sketch and a value.  It should return the 
> index where the percentile of the value falls.
> For instance, consider the value 17 who is percentile 27.  If we use 25, 75, 
> 95 to define our bins, this function would return 1, because its percentile, 
> 27, is between 25 and 75.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to