This is quite a nasty one! I've been having a look at it, and I think it would be useful to go back to first principles.
Given a series u[1:N], the correlation coefficient between u[1:(N-k)] and u[(k+1):N] is c(k) = cov(u[1:(N-k)],u[(k+1):N])/sqrt(var(u[1:(N-k)])*var(u(k+1):N])) in which, in particular, the means of u[1:(N-k)] and u[(k+1):N] are subtracted from their respective series in caluclating the variances and the covariance. This definition can be adopted for the serial correlation of lag k. However, variants of this definition are often adopted for the sake of "better estimation" (in various senses). These include subtracting the mean M of the entire series from u[1:(N-k)] and u[(k+1):N], and then forming sum( (u[1:(N-k)]-M)*( u[(k+1):N-M) ) in calculating the covariance, and using var(u[1:N]) in the denominator instead of the above denominator. But only the above definition of c(k) is guaranteed to yield a result between -1 and 1; variants can theoretically give a result outside this range. This is unlikely to happen for a long series -- indeed, there is unlikely to be much difference between the various ways of doing it in a long series. But in your case you do not have long series. I have read your series into a vector called x, of length N=6940. For example, the set of values i such that neither x[i] nor x[i+1] is NA is: which((!is.na(x[1:(N-1)])&(!is.na(x[2:N])))) [1] 3238 4604 4605 4606 4932 4933 4985 4991 4992 4993 4994 [12] 5331 5353 5429 5478 5479 5527 5554 5555 5569 5583 5597 [23] 5625 5639 5653 5681 5682 5683 5905 5947 5977 5978 6043 [34] 6044 6045 6241 6283 6297 6301 6311 6395 6396 6397 6398 only 44 in length; and the lengths of sequences of succesive integers in the above are: 1 3 2 1 4 1 1 1 2 1 2 1 1 1 1 1 1 3 1 1 2 3 1 1 1 1 1 4 Now let ix.1 <- which((!is.na(x[1:(N-1)])&(!is.na(x[2:N])))) x1.0 <- x[ix.1] x1.1 <- x[ix.1+1] Now we can calculate a correlation of lag 1 as c(1) = cov(x1.0,x1.1)/sqrt(var(x1.0)*var(x1.1)) = 0.8511088 which is nicely between 0 and 1. This is calculated using the basic definition of c(k) above, not any "variant". NOW: What variant would we like to play with? [1]: Subtract the mean of all the x's in x1.0 and x1.1 ix <- union(ix.1,(ix.1+1)); n <- length(ix) M <- mean(x[ix]) v1.0 <- x1.0 - M ; v1.1 <- x1.1 - M sum(v1.0*v1.1)/sqrt(sum(v1.0^2)*sum(v1.1^2)) = 0.8523563 which is not very different. [2]: Subtract the mean of all non-NA x's in x M <- mean(x,na.rm=TRUE) v1.0 <- x1.0 - M ; v1.1 <- x1.1 - M sum(v1.0*v1.1)/sqrt(sum(v1.0^2)*sum(v1.1^2)) = 0.8621987 which is a bit more different [3]: Use the covariance of x1.0,x1.1 as in c(k), but use the variance of all the non-NA x's V <- var(x,na.rm=TRUE) cov(x1.0,x1.1)/V = 1.196581 **This is beginning to look like what R's acf() does! But not quite: acf at lag 1 gave 1.258, at lag 2 1.192 However, it makes the point ... I too would be unhappy with results, purporting to be correlations, which so much exceeded 1. I would normally think of using the above method for c(k), along the lines of Ck <- function(x,k){ ix.1 <- which((!is.na(x[1:(N-k)])&(!is.na(x[(k+1):N])))) x1.0 <- x[ix.1] x1.1 <- x[ix.1+k] cov(x1.0,x1.1)/sqrt(var(x1.0)*var(x1.1)) } for(k in (1:10)){ acfs[k]<-Ck(x,k) } cbind((1:10),acfs) # acfs # 1 0.8511088 # 2 0.5881168 # 3 0.6543933 # 4 0.9046573 # 5 0.9392939 # 6 0.8076299 # 7 0.8605418 # 8 0.9431999 # 9 0.9667194 # 10 0.9477807 (Note): I've also had a look at the code in acf(), but it's not clear to me exactly what it does in a case like yours! Hoping this is of some help, Ted. On 14-Sep-09 13:53:01, Steve Jones wrote: > I misunderstood what the help page was saying - I thought that the > missing values were what constituted the invalid function. Clearly I > was mistaken. > > Looks like I'll have to compute the acf myself. Shame, but such is > life. > > Thanks for the input, everyone! > Steve. > > Berwin A Turlach wrote: >> G'day Steve, >> >> On Mon, 14 Sep 2009 13:44:56 +0200 >> Steve Jones <st...@squaregoldfish.co.uk> wrote: >> >>> Apologies for the missing data. It can be downloaded from here >>> (22Kb): http://www.squaregoldfish.co.uk/sekrett/series.csv >> >> Well, the Details section of acf's help page states: >> >> By default, no missing values are allowed. If the 'na.action' >> function passes through missing values (as 'na.pass' does), the >> covariances are computed from the complete cases. This means >> that the estimate computed may well not be a valid autocorrelation >> sequence, and may contain missing values. [...] >> >> And you have seem to have a massive amount of missing data: >> >> R> dat <- >> scan(url("http://www.squaregoldfish.co.uk/sekrett/series.csv")) >> Read 6940 items >> R> mean(!is.na(dat)) >> [1] 0.02881844 >> >> And, not surprisingly, an even smaller proportion of consecutive, >> non-missing observations. >> >> R> mean(!is.na(dat[-1]) & !is.na(dat[-length(dat)])) >> [1] 0.006340971 >> >> You can find out which formulae are used exactly by acf by studying >> the >> code, but this might give you an idea about what is going on: >> >> R> ind <- !is.na(dat) >> R> (mu <- mean(dat[ind])) ## too lazy for mean(dat, na.rm=TRUE) >> [1] 373.5165 >> R> (sig2 <- var(dat[ind])) ## ditto >> [1] 463.4041 >> R> ind <- which(!is.na(dat[-1]) & !is.na(dat[-length(dat)])) >> R> sum( (dat[ind]-mu) * (dat[ind+1] - mu)) / length(ind) >> [1] 593.3041 >> R> sum( (dat[ind]-mu) * (dat[ind+1] - mu)) / length(ind) / sig2 >> [1] 1.280317 >> >> HTH >> Cheers, >> Berwin -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 14-Sep-09 Time: 18:20:53 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.