Agree, just rounding only makes sense if the values are sort of evenly distributed -- in my case they were in 0,1. I will put it on my to-do list to look at, yes. Thanks for the confirmation.
On Sun, Nov 2, 2014 at 7:44 PM, Xiangrui Meng <men...@gmail.com> wrote: > Yes, if there are many distinct values, we need binning to compute the > AUC curve. Usually, the scores are not evenly distribution, we cannot > simply truncate the digits. Estimating the quantiles for binning is > necessary, similar to RangePartitioner: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104 > . Limiting the number of bins is definitely useful. Do you have time > to work on it? -Xiangrui > > On Sun, Nov 2, 2014 at 9:34 AM, Sean Owen <so...@cloudera.com> wrote: >> This might be a question for Xiangrui. Recently I was using >> BinaryClassificationMetrics to build an AUC curve for a classifier >> over a reasonably large number of points (~12M). The scores were all >> probabilities, so tended to be almost entirely unique. >> >> The computation does some operations by key, and this ran out of >> memory. It's something you can solve with more than the default amount >> of memory, but in this case, it seemed unuseful to create an AUC curve >> with such fine-grained resolution. >> >> I ended up just binning the scores so there were ~1000 unique values >> and then it was fine. >> >> Does that sound generally useful as some kind of parameter? or am I >> missing a trick here. >> >> Sean >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org