Hello,
It would be best to move this discussion over to gsl-discuss. I think
it would be very useful to have this function in GSL. Just a few
comments on your code:
1) The code looks clean and nicely commented. One issue is that since
you appear to have followed the apache code very closely, there may be a
licensing issue - I don't know if the Apache license is compatible with
the GPL. On a quick check, its possible we can use it but it seems we
need to preserve the original copyright notice.
2) Dynamic allocation - it looks like you dynamically allocate 5
different arrays to do the calculation. It would be better to either
make functions like gsl_stats_spearman_alloc and
gsl_stats_spearman_free, or to pass in a pre-allocated workspace as one
of the function arguments. Since you're using workspace of different
types (double,size_t), its probably better to make the alloc/free functions.
3) One of your dynamically allocated arrays is realloc()'d in a loop. Is
this because the size of the array is unknown before the loop? Perhaps
there is a way to avoid the realloc's.
4) We also need to think of some automated tests that can be added to
statistics/test.c to test this function exhaustively and make sure its
working correctly - even if that consists simply of known output values
for a few different input cases.
Good work,
Patrick Alken
On 02/09/2012 04:26 PM, Timothée Flutre wrote:
Hello,
I noticed that only the Pearson correlation coefficient is implemented
in the GSL (http://www.gnu.org/software/gsl/manual/html_node/Correlation.html).
However, in quantitative genetics, several authors are using the
Spearman coef (for instance, Stranger et al "Population genomics of
human gene expression", Nature Genetics, 2007) as it is less
influenced by outliers.
Current high-throughput data requires to compute such coef several
millions of times. Thus I implemented the computation of the Spearman
coef in GSL-like code. In fact, one just need to rank the input
vectors and then compute the Pearson coef on them. For the ranking, I
got inspired by the code from the Apache Math module.
I was thinking that it could be useful to other users to add my piece
of code to the file "covariance_source.c" of the GSL
(http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/covariance_source.c#L77).
So here is the code: https://gist.github.com/1784199
I am not very proficient in C, so even if it is not possible to
include the code in the GSL, don't hesitate to give me advice.
Thanks,
Tim