Dear All,
I am studying some process measurement time series in R and trying to identify
time delays using cross-correlation function ccf. The results have however been
bit confusing. I found a couple of years old message about this issue but
unfortunately wasn't able to find it again for a reference.
For example, an obvious time shift is observed between the measurements y1 and
y2 when the following test data is plotted:
x <- 1:121
y1 <- c(328.27, 328.27, 328.27, 328.27, 328.21, 328.14, 328.14, 328.01,
328.07, 328.01, 327.87, 328.01, 328.07, 328.27, 328.27, 328.54, 328.61,
328.74, 328.88, 329.01, 329.01, 329.21, 329.28, 329.35, 329.35, 329.42,
329.35, 329.28, 329.28, 329.15, 329.08, 329.08, 328.95, 328.95, 328.95,
328.95, 329.01, 329.15, 329.21, 329.28, 329.55, 329.62, 329.75, 329.82,
329.89, 330.09, 330.09, 330.29, 330.29, 330.36, 330.42, 330.29, 330.15,
330.22, 330.09, 329.95, 329.82, 329.75, 329.62, 329.55, 329.48, 329.55,
329.68, 329.75, 329.82, 329.89, 330.09, 330.09, 330.15, 330.22, 330.42,
330.42, 330.42, 330.36, 330.42, 330.22, 330.15, 330.09, 329.89, 329.75,
329.55, 329.35, 329.35, 329.42, 329.48, 329.55, 329.75, 329.75, 329.82,
330.09, 330.15, 330.42, 330.42, 330.62, 330.69, 330.69, 330.83, 330.83,
330.76, 330.62, 330.62, 330.56, 330.42, 330.42, 330.29, 330.29, 330.29,
330.29, 330.22, 330.49, 330.56, 330.62, 330.76, 331.03, 330.96, 331.16,
331.23, 331.50, 331.63, 332.03, 332.03)
y2 <- c(329.68, 329.75, 329.82, 329.95, 330.02, 330.15, 330.22, 330.36,
330.22, 330.29, 330.29, 330.29, 330.29, 330.15, 330.22, 330.22, 330.15,
330.15, 330.15, 330.15, 330.15, 330.29, 330.49, 330.49, 330.62, 330.89,
331.03, 331.09, 331.16, 331.30, 331.30, 331.36, 331.43, 331.43, 331.43,
331.36, 331.36, 331.36, 331.36, 331.23, 331.23, 331.16, 331.16, 331.23,
331.30, 331.23, 331.36, 331.56, 331.70, 331.83, 331.97, 331.97, 332.10,
332.30, 332.44, 332.44, 332.51, 332.51, 332.57, 332.57, 332.51, 332.37,
332.24, 332.24, 332.10, 331.97, 331.97, 331.90, 331.83, 331.97, 331.97,
331.97, 332.03, 332.24, 332.30, 332.30, 332.37, 332.57, 332.57, 332.57,
332.57, 332.57, 332.51, 332.37, 332.30, 332.17, 331.97, 331.83, 331.70,
331.70, 331.63, 331.63, 331.70, 331.83, 331.90, 332.10, 332.24, 332.30,
332.44, 332.57, 332.71, 332.84, 332.77, 332.91, 332.84, 332.84, 332.91,
332.84, 332.77, 332.77, 332.64, 332.64, 332.57, 332.57, 332.57, 332.57,
332.57, 332.71, 332.98, 333.24, 333.58)
matplot(cbind(y1, y2))
However, the cross-correlation function surprisingly gives the maximum
correlation 0.83 at zero lag:
ccf(y1, y2)
Plotting of variables against each other with zero lag
plot(y1, y2)
shows that the correlation is not that good. Instead, a very nice correlation
is obtained with a plot with shifted variables:
plot(y1[1:113], y2[1:113 + 8])
As a comparison I defined my own cross correlation function:
cross.cor <- function(x, y, k)
{
n <- length(x) - abs(k)
if (k >= 0)
cor(x[1:n + k], y[1:n])
else
cor(x[1:n], y[1:n - k])
}
The new function cross.cor gives maximum correlation of 0.99 at lag -8, and the
shape of the correlation function is very different from the one obtained with
ccf (both methods give same value at zero lag):
plot(-15:15, sapply(-15:15, function(lag) cross.cor(y1, y2, lag)),
ylim = c(0.3, 1))
points(-15:15, ccf(y1, y2, 15, plot = FALSE)$acf, col = "red")
In order to understand the reason for the differences, I looked at the program
codes of ccf function. When the variables are compared with a nonzero lag, some
the data points must be left out from the tails of the time series. It appears
to me that ccf calculates (partly in R code, partly in C code) first the means
and the standard deviations using the whole data, and then uses those values as
constants in further calculations. The time series are truncated only in the
summations for covariances.
That approach naturally speeds up the computations, but is it really correct?
Is the approach used in ccf somehow statistically more correct? I suppose the
strong increasing trend in my test data emphasizes this issue (leaving data
points out from one end has big impact on the average).
Best regards,
Sami Toppinen
[email protected]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel