Thanks for your input! 1) Here is the text of the license under which the Apache code is: http://www.apache.org/licenses/LICENSE-2.0. Indeed it seems that we would have to indicate their copyright. Is this a problem? In a way, there is not a lot of different algorithms to compute the Spearman coefficient...
2) I have made the changes and now have "gsl_stats_spearman_alloc" and "gsl_stats_spearman_free" functions for the four arrays ranks1, ranks2, d and p. I added the code as a 2nd file to the same gist: https://gist.github.com/1784199#file_spearman_v2.c 3) Yes, we don't know in advance how many ties there will be. That's why I reallocate inside the loop. I don't see how I can do differently. 4) I added a function performing tests, using the data defined in statistics/test_float_source. c. What do I do now? Do I need to have write access to the GSL repository on Savannah? Or maybe someone else can do it for me? Thanks, Tim On Thu, Feb 9, 2012 at 6:04 PM, Patrick Alken <[email protected]> wrote: > > Hello, > > It would be best to move this discussion over to gsl-discuss. I think it > would be very useful to have this function in GSL. Just a few comments on > your code: > > 1) The code looks clean and nicely commented. One issue is that since you > appear to have followed the apache code very closely, there may be a > licensing issue - I don't know if the Apache license is compatible with the > GPL. On a quick check, its possible we can use it but it seems we need to > preserve the original copyright notice. > > 2) Dynamic allocation - it looks like you dynamically allocate 5 different > arrays to do the calculation. It would be better to either make functions > like gsl_stats_spearman_alloc and gsl_stats_spearman_free, or to pass in a > pre-allocated workspace as one of the function arguments. Since you're using > workspace of different types (double,size_t), its probably better to make the > alloc/free functions. > > 3) One of your dynamically allocated arrays is realloc()'d in a loop. Is this > because the size of the array is unknown before the loop? Perhaps there is a > way to avoid the realloc's. > > 4) We also need to think of some automated tests that can be added to > statistics/test.c to test this function exhaustively and make sure its > working correctly - even if that consists simply of known output values for a > few different input cases. > > Good work, > Patrick Alken > > > On 02/09/2012 04:26 PM, Timothée Flutre wrote: >> >> Hello, >> >> I noticed that only the Pearson correlation coefficient is implemented >> in the GSL >> (http://www.gnu.org/software/gsl/manual/html_node/Correlation.html). >> However, in quantitative genetics, several authors are using the >> Spearman coef (for instance, Stranger et al "Population genomics of >> human gene expression", Nature Genetics, 2007) as it is less >> influenced by outliers. >> >> Current high-throughput data requires to compute such coef several >> millions of times. Thus I implemented the computation of the Spearman >> coef in GSL-like code. In fact, one just need to rank the input >> vectors and then compute the Pearson coef on them. For the ranking, I >> got inspired by the code from the Apache Math module. >> >> I was thinking that it could be useful to other users to add my piece >> of code to the file "covariance_source.c" of the GSL >> (http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/covariance_source.c#L77). >> So here is the code: https://gist.github.com/1784199 >> >> I am not very proficient in C, so even if it is not possible to >> include the code in the GSL, don't hesitate to give me advice. >> >> Thanks, >> Tim >> >
