Laurent Therond wrote: > > [Question] > How to apply the rules of correlation and linear regression to a data > set constituting a discrete time series? > > In other words, given a data set such as the following: > > 2002/1/1, 4500.63 > 2002/1/2, 10255.36 > 2002/1/8, 6530.63 > 2002/1/9, 5230.36 > ... > ... > ... > > How to determine the regression line describing the overall trend of > the time series? > > [Difficulty] > Correlation coefficient, slope, intercept and standard deviation of > the residuals are easy to calculate when both variables (X and Y) are > numerical values from a continuous interval. > Yet, how should I handle dates such as "2002/1/9"? How should I > convert them to numerical values without altering the fundamental > information they contain and represent? > If I have 2 time series resulting from the observation of a same > phenomenon, how should I build their respective regression line so > that I may compare them? (granted that scales of measurements could > be different and that dates at which those measurements are taken > could be different) > > [Thoughts] > I am trying to solve this problem programmatically. > So, in a first attempt, I converted all dates to their number of > millisecond since the epoch (January 1, 1970). I guess this should > work. > Yet, I came to doubt that solution because it seems arbitrary to pick > an origin in such a fashion. > So, I did a search on that topic and I was told to be able to compare > 2 resulting regression lines I would have to standardize my data by > using Z-scores instead of the raw values '2002/1/9' and '5230.36'. Is > that true? (Yet, this advice does not tell me how to obtain a valid > numerical representation of '2002/1/9' I could use to compute the Z- > score of that date.) > Brief, I am confused. > > PS: I am not interested in inferential statistics or forecasting. I am > only interested in *describing* past data.
If your data is taken at time intervals of days or hours, I'd convert the dates to Julian dates (I can give you a formula if you need it), perhaps subtracting off the earliest just to keep the numbers smaller. Unless your data is taken at millisecond intervals I think milliseconds is not the appropriate unit. You could also use decimal years, but that doesn't handle leap years as neatly, I think. BTW the American Association of Variable Star Observers records all its data in terms of Julian date (and requires its observers to report them that way, as I recall), so there is a real world example of time series data handled that way. There is no real "arbitrariness" involved in picking some date as your zero point in time. Unless there is some theoretical reason that your computation should be referenced to the moment of the Big Bang, there is always some arbitrariness in dating. Just make sure you're using the same zero point and same units and you should be able to directly compare regressions, etc. The comment about Z-values confuses me, too. Hope that helps. Regards, Russell -- All too often the study of data requires care. . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
