[ 
https://issues.apache.org/jira/browse/METRON-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15770140#comment-15770140
 ] 

ASF GitHub Bot commented on METRON-637:
---------------------------------------

Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/401
  
    Ok, so a couple of things.  I ran a quick perf test on this function as it 
stands.  On my macbook from 4 years ago, I ran the `STATS_BIN` function a 
million times with random values and it took ~5.5s.  Even at a throughput of 1M 
messages per second, if we assume that the messages are spread across the 
cluster, I think this keeps up.
    
    Now, *EVEN GIVEN THIS*, normally I would go through the effort here of 
adding a caching layer to save us the computation of the percentile, but it's 
actually quite difficult to figure out when two `StatisticsProvider` objects 
are equivalent without resorting to actually calling percentiles.  I could do 
things like use non-computed attributes (number of data points, sum of data 
points, average and sum of data points), but the former Math grad student in me 
was uncomfortable in that.  It's just very hard to not be absolutely sure that 
you couldn't have all of those attributes the same and be different 
distributions.
    
    I think given these things together that I'm going to recommend to cross 
the caching bridge when we come to it.  Now, it won't take much to convince me 
to go the other direction, so if you (or anyone else, really ;) feels strongly 
@mattf-horton , I'll go ahead and tackle that dude as best I can.
    
    I'm going to correct the rest of your comments and check in the performance 
test I ran, so you can see that I have nothing up my sleeves and so it can be 
run periodically.


> Add a STATS_BIN function to Stellar.
> ------------------------------------
>
>                 Key: METRON-637
>                 URL: https://issues.apache.org/jira/browse/METRON-637
>             Project: Metron
>          Issue Type: Improvement
>            Reporter: Casey Stella
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When passing parameters to models, it's often useful to pass the binned 
> representation of a variable based on an empirical statistical distribution, 
> rather than the actual variable.  This function should accept a set of 
> percentile bins and a statistical sketch and a value.  It should return the 
> index where the percentile of the value falls.
> For instance, consider the value 17 who is percentile 27.  If we use 25, 75, 
> 95 to define our bins, this function would return 1, because its percentile, 
> 27, is between 25 and 75.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to