On 11/8/13 4:35 AM, Matt Adereth wrote: > While writing the test cases for KendallsCorrelation, I discovered an > interesting behavior with SpearmansCorrelation that might be considered an > inconsistency. SpearmansCorrelation.correlate() throws > MathIllegalArgumentException if the array length is less than 2, but > returns Double.NaN if the array contains multiple copies of a single value.
The latter sounds like a bug, assuming you are using the default NaturalRanking rank transform. Ties should be averaged and handled correctly in this case. Please open a JIRA, ideally with test case for this. > > This seems inconsistent with how insufficient data is handled elsewhere in > Apache Commons Math. Good point. I think there is justification for the different behavior here though. SimpleRegression and the univariate stats are mutable, maintaining a dataset that can be added to, with stats queried at any point. So while in theory, getSlope() in SimpleRegression could throw IllegalStateException (IAE not really appropriate here) when there is not enough data in the model, its documented behavior in this case is to return NaN. The key is to clearly document the behavior. SimpleRegression does this well, the correlation classes not so much. Patches welcome to improve the documentation of preconditions and behavior of these classes. I would be OK with changing the correlation classes to return NaNs in place of throwing IAE on insufficient data; but this change should happen in a major release (i.e. wait for 4.0). Phil > > In the User Guide for SimpleRegression it says: > >> When there are fewer than two observations in the model, or when there is > no variation in the x values (i.e. all x values are the same) all > statistics return NaN. At least two observations with different x > coordinates are required to estimate a bivariate regression model. > > Similarly, all the UnivariateStatistics return Double.NaN when there isn't > enough data. > > When I'm computing various statistics on multiple datasets, it seems > unnecessarily cumbersome to specially handle an exception for statistic and > NaNs for the others. I propose that PearsonsCorrelation and > SpearmansCorrelation should return NaN if there is insufficient data, > whether it be from not enough observations (< 2) or not enough unique > values. > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org