Agree, just rounding only makes sense if the values are sort of evenly
distributed -- in my case they were in 0,1. I will put it on my to-do
list to look at, yes. Thanks for the confirmation.

On Sun, Nov 2, 2014 at 7:44 PM, Xiangrui Meng <men...@gmail.com> wrote:
> Yes, if there are many distinct values, we need binning to compute the
> AUC curve. Usually, the scores are not evenly distribution, we cannot
> simply truncate the digits. Estimating the quantiles for binning is
> necessary, similar to RangePartitioner:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104
> . Limiting the number of bins is definitely useful. Do you have time
> to work on it? -Xiangrui
>
> On Sun, Nov 2, 2014 at 9:34 AM, Sean Owen <so...@cloudera.com> wrote:
>> This might be a question for Xiangrui. Recently I was using
>> BinaryClassificationMetrics to build an AUC curve for a classifier
>> over a reasonably large number of points (~12M). The scores were all
>> probabilities, so tended to be almost entirely unique.
>>
>> The computation does some operations by key, and this ran out of
>> memory. It's something you can solve with more than the default amount
>> of memory, but in this case, it seemed unuseful to create an AUC curve
>> with such fine-grained resolution.
>>
>> I ended up just binning the scores so there were ~1000 unique values
>> and then it was fine.
>>
>> Does that sound generally useful as some kind of parameter? or am I
>> missing a trick here.
>>
>> Sean
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to