Hello,

Inline

Em 21-02-2014 23:13, Rolf Turner escreveu:
On 22/02/14 11:04, Rui Barradas wrote:
Hello,

Not answering directly to your question, if the sample size is a
documented problem with shapiro.test and you want a normality test, why
don't you use ?ks.test?

m <- mean(HP_TrinityK25$V2)
s <- sd(HP_TrinityK25$V2)

ks.test(HP_TrinityK25$V2, "pnorm", m, s)

Strictly speaking this is not a valid test.  The KS test is used for
testing against a *completely specified* distribution.  If there are
parameters to be estimated, the null distribution is no longer
applicable.  This may not be a "real" problem if the parameters are
*well* estimated, as they would be in this instance (given that the
sample size is over-large).  I'm not sure about this.

Yes, you're right. I hesitated before posting my answer precisely because of this, the parameters must be pre-determined constants, not computed from the data. Like Greg pointed out in his reply, the help page for ?ks.test also explicitly refers to it (which I had missed).

The chi-squared gof test seems to be a good choice, given the sample size.

Rui Barradas

The "Lilliefors" test is theoretically available in this context when
mu and sigma are estimated, but according to the Wikipedia article, the
Lilliefors distribution is not known analytically and the critical
values must be determined by Monte Carlo methods.  There is a
"LillieTest" function in the "DescTools" package which makes use of some
approximations to get p-values.

However I think that a better approach would be to use a chi-squared
goodness of fit test whereby you can adjust for estimated parameters
simply by reducing the degrees of freedom.  I believe that the
chi-squared test is somewhat low in power, but with a very large sample
this should not be a problem.

The difficulty with the chi-squared test is that the choice of "bins" is
somewhat arbitrary.  I believe the best approach is to take the bin
boundaries to be the quantiles of the normal distribution (with
parameters "m" and "s") corresponding to equispaced probabilities on
[0,1], with the number of such probabilities being k+1 where
k = floor(n/5), n being the sample size.  This makes the expected counts
all equal to n/k >= 5 so that the chi-squared test is "valid".  The
degrees of freedom are then k-3 (k - 1 - #estimated parameters).

One last comment:  I believe that it is generally considered that
testing for normality is a waste of time and a pseudo-intellectual
exercise of academic interest at best.

cheers,

Rolf Turner



Hope this helps,

Rui Barradas

Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
Dear R users,
Please help with with this maybe basic question. I am trying to see
if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x =
HP_TrinityK25$V2)
:  sample size must be between 3 and 5000"
thanks!

  shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be
between 3
and 5000

##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file

    [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to