On 11/8/13 1:27 PM, Phil Steitz wrote: > On 11/8/13 4:35 AM, Matt Adereth wrote: >> While writing the test cases for KendallsCorrelation, I discovered an >> interesting behavior with SpearmansCorrelation that might be considered an >> inconsistency. SpearmansCorrelation.correlate() throws >> MathIllegalArgumentException if the array length is less than 2, but >> returns Double.NaN if the array contains multiple copies of a single value. > The latter sounds like a bug, assuming you are using the default > NaturalRanking rank transform. Ties should be averaged and handled > correctly in this case. Please open a JIRA, ideally with test case > for this.
Does not actually look like a bug, at least I have not been able to reproduce it. You do get NaN when there are not at least two distinct values in the x array (the first array to be correlated). That does need to be documented (as it is in SimpleRegression). Phil > >> This seems inconsistent with how insufficient data is handled elsewhere in >> Apache Commons Math. > Good point. I think there is justification for the different > behavior here though. SimpleRegression and the univariate stats are > mutable, maintaining a dataset that can be added to, with stats > queried at any point. So while in theory, getSlope() in > SimpleRegression could throw IllegalStateException (IAE not really > appropriate here) when there is not enough data in the model, its > documented behavior in this case is to return NaN. The key is to > clearly document the behavior. SimpleRegression does this well, the > correlation classes not so much. Patches welcome to improve the > documentation of preconditions and behavior of these classes. I > would be OK with changing the correlation classes to return NaNs in > place of throwing IAE on insufficient data; but this change should > happen in a major release (i.e. wait for 4.0). > > Phil > > >> In the User Guide for SimpleRegression it says: >> >>> When there are fewer than two observations in the model, or when there is >> no variation in the x values (i.e. all x values are the same) all >> statistics return NaN. At least two observations with different x >> coordinates are required to estimate a bivariate regression model. >> >> Similarly, all the UnivariateStatistics return Double.NaN when there isn't >> enough data. >> >> When I'm computing various statistics on multiple datasets, it seems >> unnecessarily cumbersome to specially handle an exception for statistic and >> NaNs for the others. I propose that PearsonsCorrelation and >> SpearmansCorrelation should return NaN if there is insufficient data, >> whether it be from not enough observations (< 2) or not enough unique >> values. >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org