[jira] [Created] (MATH-1441) SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call

Max Aller (JIRA) Mon, 22 Jan 2018 20:16:36 -0800

Max Aller created MATH-1441:
-------------------------------

             Summary: SimpleRegression#getSlopeConfidenceInterval recalculates 
t distribution on every call
                 Key: MATH-1441
                 URL: https://issues.apache.org/jira/browse/MATH-1441
             Project: Commons Math
          Issue Type: Improvement
    Affects Versions: 3.6.1
         Environment: Java 8, Linux x64.
            Reporter: Max Aller



SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other of 
100k or 1M times), is surprisingly slow - 3M calls, on my 3rd gen i7 machine, 
takes *75 seconds*. This is primarily because recalculating the inverse 
cumulative probability, and reconstructing the TDistribution object itself, is 
somewhat expensive, entailing a lot of `log` and `sqrt` and iteration and all 
that stuff.

The confidence interval is based on two values - *n* and *alpha*. I'd posit 
that alpha will _usually_ be one of a very small set of values, and n, well, at 
least in my case, I'm always calling this method with the same number of data 
points - a moving window of time-series data. But I recognize n might be all 
over the place for some users.

In any event, I strongly believe some level of caching would greatly benefit 
the users of Commons Math. I've forked my own version of 
getSlopeConfidenceInterval that uses caching. Here's a test case demonstrating 
that:
{code:java}
class SlowRegressionsTest {
    @Test
    void slow() {
        SimpleRegression simpleRegression = new SimpleRegression();
        simpleRegression.addData(0.0, 0.0);
        simpleRegression.addData(1.0, 1.0);
        simpleRegression.addData(2.0, 2.0);
        long start = System.currentTimeMillis();
        for (int i = 0; i < 3_000_000; i++) {
            simpleRegression.getSlopeConfidenceInterval();
        }
        long duration = System.currentTimeMillis() - start;
        System.out.printf("`slow` took %dms%n", duration);
    }

    @Test
    void fast() {
        SimpleRegression simpleRegression = new SimpleRegression();
        simpleRegression.addData(0.0, 0.0);
        simpleRegression.addData(1.0, 1.0);
        simpleRegression.addData(2.0, 2.0);
        long start = System.currentTimeMillis();
        for (int i = 0; i < 3_000_000; i++) {
            
SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression);
        }
        long duration = System.currentTimeMillis() - start;
        System.out.printf("`fast` took %dms%n", duration);
    }
}{code}
which prints out
{noformat}
`fast` took 159ms
`slow` took 75304ms{noformat}
Nearly 500x faster - 53ns/call. My particular solution is written in Kotlin for 
Java 8, so not of direct relevance to you, but here it is:
{code:java}
package math

import org.apache.commons.math3.distribution.TDistribution
import org.apache.commons.math3.exception.OutOfRangeException
import org.apache.commons.math3.exception.util.LocalizedFormats
import org.apache.commons.math3.stat.regression.SimpleRegression
import java.util.concurrent.ConcurrentHashMap

@JvmOverloads
fun SimpleRegression.fastGetSlopeConfidenceInterval(alpha: Double = 0.05): 
Double {
    if (n < 3) {
        return Double.NaN
    }
    if (alpha >= 1 || alpha <= 0) {
        throw OutOfRangeException(
            LocalizedFormats.SIGNIFICANCE_LEVEL,
            alpha, 0, 1
        )
    }
    // No advertised NotStrictlyPositiveException here - will return NaN above
    // PATCH: use cached inverse cumulative probability
    return slopeStdErr * getInverseCumulativeProbability(n, alpha)
}

private val cache = ConcurrentHashMap<Key, Double>()

private data class Key(val n: Long, val alpha: Double)

private fun getInverseCumulativeProbability(n: Long, alpha: Double): Double =
    cache.getOrPut(Key(n, alpha)) {
        TDistribution((n - 2).toDouble()).inverseCumulativeProbability(1.0 - 
alpha / 2.0)
    }
{code}
Limitations: 1. Kotlin, 2. ConcurrentHashMap is unbounded here.

I don't know how/if Commons Math does caching elsewhere, but it'd sure be handy 
here, I believe. What are your thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MATH-1441) SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call

Reply via email to