Alex, John may follow your comment better, but I think I would need you to be more specific about how this comment would be implemented in PSPP to solve a specific problem (i.e., compute robust significance values for the correlations in CROSSTABS).
-Alan On 12/26/2020 10:08 PM, Alex Ernesto Davila Davila wrote: > Dear all, > > Concerning the discussion on the validity of the Spearman test for > correlations, some ideas follow: > > a) Symmetry of a distribution is the property I would look before > choosing or not the mean as a central measure and the standard > deviation as a dispersion measure. I would say that normality > assumption is not necessary, symmetry suffices to use means, standard > deviations, and Pearson correlations. As we know, mean and variance > may be estimators of several distributions. > > b) The significance of a test depends on the theoretical > distribution: again, neither a normal nor a t-student theoretical > distribution would be needed as a priori distribution: It would > suffice to characterize the theoretical distribution that would fit > better to the sample distribution: normal, uniform, or whatever it may > come out and have means and variances as estimators. Basically, the > theoretical distribution would be the limit case going from a rational > to a real space: To make things simpler, a 1 dimensional space. > > c) If symmetry is lacking, then an alternative story may stand based > on medians and dispersion measures as P75-P25. > > d) The use of ranks is the easiest way to deal with asymmetry but > information as c) is shadowed somehow. > > e) What it seems to me too far fetched is the definition of the > Spearman coefficient and its derivation from Pearson. > > Best, > > Alex > Pontifical Catholic University of Peru > > > > > > > > > > El sáb, 26 dic 2020 a las 20:58, Alan Mead (<ame...@alanmead.org > <mailto:ame...@alanmead.org>>) escribió: > > > > On 12/26/2020 2:27 PM, John Darrington wrote: >> There is a brief discussion of the issue here: >> https://www.spss-tutorials.com/spearman-rank-correlation/ >> <https://www.spss-tutorials.com/spearman-rank-correlation/> >> >> but again, to be sure, I'd want to review some of the academic literature >> first. > > John, > > I see your point. This tutorial says that for N >= 30, you should > use the standard t-test (that's my read). The formula given is: > > t = rs * sqrt( N - 2 ) / sqrt( 1 - rs**2 ); > df = N - 2; > > You then compare this to the t-distribution. > > When N<30, he references a permutation test. This test constructs > an empirical H0 distribution (similar to something like > bootstrapping) based on the assumption that if H0 is true, you can > randomly permute the two samples without damaging the correlation. > So, one version of this test takes the dataset <X,Y> and > constructs a new dataset <S1,S2> where each element of X[i] is > randomly assigned to S1 or S2 (and Y[i] is assigned to the other) > and Rs is calculated. This is then repeated until you have a > sufficient empirical H0 distribution. > > This can be done exactly (i.e., each possible permutation can be > enumerated) for small N. I'm having trouble visualizing how many > values this is... You're making a binary choice for each element, > so if you have N=10, that's 2**10 = 1024 possible choices of S1 > and S2? But one post suggested that it's 10! = 3.6E6, which is > getting big. In samples sizes like 10 < N < 30 you would just > choose a large random set of permuted datasets (like bootstrapping). > > I guess R spearman_test implements this test and that the test > fails if there are ties. I guess we could examine the R code to > see how this works? > > This paper, https://arxiv.org/pdf/2008.01200.pdf > <https://arxiv.org/pdf/2008.01200.pdf>, suggests that the test is > flawed both in small samples and in samples with distinctly > non-normal underlying data. I don't know what it means to be > "normally distributed" for ranks... Ranks are always distributed > uniformly unless there are ties. Their method is implemented in > the 'perk' library and is also a sampling/resampling approach. > > IIRC, the inquiry that started this discussion was about a sample > of N = 100. I think PSPP should just report the standard t-test > results for all cases. This replicates SPSS bug-for-bug. > > Alternatively, I wouldn't be upset if PSPP refuses to print any > p-value for N < 30. I think ideally we would add a keyword > requesting a more advanced algorithm. > > Finally, I don't think any of this discussion bears on why the > p-value is missing from the Pearson r in CROSSTABS. > > -Alan > > -- > > Alan D. Mead, Ph.D. > President, Talent Algorithms Inc. > > science + technology = better workers > > http://www.alanmead.org <http://www.alanmead.org> > > The irony of this ... is that the Internet is > both almost-infinitely expandable, while at the > same time constrained within its own pre-defined > box. And if that makes no sense to you, just > reflect on the existence of Facebook. We have > the vastness of the internet and yet billions > of people decided to spend most of them time > within a horribly designed, fake-news emporium > of a website that sucks every possible piece of > personal information out of you so it can sell it > to others. And they see nothing wrong with that. > > -- Kieren McCarthy, commenting on why we are not > all using IPv6 > -- Alan D. Mead, Ph.D. President, Talent Algorithms Inc. science + technology = better workers http://www.alanmead.org The irony of this ... is that the Internet is both almost-infinitely expandable, while at the same time constrained within its own pre-defined box. And if that makes no sense to you, just reflect on the existence of Facebook. We have the vastness of the internet and yet billions of people decided to spend most of them time within a horribly designed, fake-news emporium of a website that sucks every possible piece of personal information out of you so it can sell it to others. And they see nothing wrong with that. -- Kieren McCarthy, commenting on why we are not all using IPv6