Dear all I need to compute percentage changes of my data, but unfortunately they contain both negative and zero values, and I am quite confused on how to proceed. Searching the internet I found that many people ran into similar issues, with no obvious solution available.
The last couple of weeks I've been playing with all the data transformations that I could think of. Below I will expose on a dummy example the issues encountered: > x$var [1] 0.43 -0.79 0.69 0.76 0.00 -1.51 -0.71 0.80 1.17 1.58 1.48 -1.83 -0.88 1.44 -0.72 -0.22 1.89 -1.27 -0.76 [20] 1.33 - raw data: percentage variations of the original data---containing negative and zero values---get messed up when passing from a negative to a positive value, and around the value 0. > x[, "raw"] <- c(NA, diff(x$var) / x[1:19,"var"]) - raw data with abs denominator: compared to the above improves the handling of the signs, but still fails around zero, and in some cases gives unexpected results (see [1]). > x[, "raw abs"] <- c(NA, diff(x$var) / abs(x[1:19,"var"])) - raw data + constant: add a constant to the data to transform them to strictly positive, then compute the deltas. This solves the negative and zero value problems, but I am not sure if this introduces some bias along the way. > x[, "raw +cst"] <- c(NA, diff((2 + x$var)) / (2 + x[1:19,"var"])) - log, car::box.cox.powers: both transformations involve adding a constant to the original data. > x[, "log"] <- c(NA, diff(log(2 + x$var)) / log(2 + x[1:19,"var"])) > require(car) > x1 <- box.cox.powers(2 + x$var); x1$lambda > x[, "box cox"] <- c(NA, diff(box.cox(2 + x$var, x1$lambda)) / box.cox(2 + > x[1:19,"var"], x1$lambda)) - sqrt: very similar to the above, but the results are a bit different (and apparently better). > x[, "sqrt"] <- c(NA, diff(sqrt(2 + x$var)) / sqrt(2 + x[1:19,"var"])) - exp: the exponential transformation introduces too much, and unevenly distributed variability (my actual data contain values bigger than "5"), and the variations can quickly get to astronomical levels. > x[, "exp"] <- c(NA, diff(exp(x$var)) / exp(x[1:19,"var"])) - atan transformation: this is an in-house bred solution, which insures that values from -Inf to +Inf are stacked between 0 and pi. Again, not sure what bias this might introduce. > mytan <- function(x) .5*pi + atan(x) > x[, "mytan"] <- c(NA, diff(mytan(x$var)) / mytan(x[1:19,"var"])) The resulting data frame: > round(x, 3) var raw raw abs raw +cst log sqrt box cox exp mytan 1 0.43 NA NA NA NA NA NA NA NA 2 -0.79 -2.837 -2.837 -0.502 -0.785 -0.294 -0.840 -0.705 -0.544 3 0.69 -1.873 1.873 1.223 4.191 0.491 6.289 3.393 1.411 4 0.76 0.101 0.101 0.026 0.026 0.013 0.038 0.073 0.021 5 0.00 -1.000 -1.000 -0.275 -0.317 -0.149 -0.407 -0.532 -0.293 6 -1.51 -Inf -Inf -0.755 -2.029 -0.505 -1.591 -0.779 -0.628 7 -0.71 -0.530 0.530 1.633 -1.357 0.623 -1.517 1.226 0.630 8 0.80 -2.127 2.127 1.171 3.043 0.473 4.631 3.527 1.355 9 1.17 0.462 0.462 0.132 0.121 0.064 0.185 0.448 0.084 10 1.58 0.350 0.350 0.129 0.105 0.063 0.169 0.507 0.059 11 1.48 -0.063 -0.063 -0.028 -0.022 -0.014 -0.035 -0.095 -0.012 12 -1.83 -2.236 -2.236 -0.951 -2.421 -0.779 -1.450 -0.963 -0.804 13 -0.88 -0.519 0.519 5.588 -1.064 1.567 -1.124 1.586 0.698 14 1.44 -2.636 2.636 2.071 9.902 0.753 16.643 9.176 1.985 15 -0.72 -1.500 -1.500 -0.628 -0.800 -0.390 -0.870 -0.885 -0.626 16 -0.22 -0.694 0.694 0.391 1.336 0.179 1.679 0.649 0.430 17 1.89 -9.591 9.591 1.185 1.356 0.478 2.333 7.248 0.960 18 -1.27 -1.672 -1.672 -0.812 -1.232 -0.567 -1.115 -0.958 -0.749 19 -0.76 -0.402 0.402 0.699 -1.684 0.303 -1.841 0.665 0.381 20 1.33 -2.750 2.750 1.685 4.592 0.639 7.559 7.085 1.711 As you have noticed, I'm quite unsure on how to proceed. My actual data represents financial EPS (earnings per share) forecasts, ranging from -1 to 5. So, it has a "natural zero point" (see David Winsemius' comments in [2]). However, I need to compute percentage variations since I am primarily interested in the evolution of the forecasts (for a given company), while EPS data between two companies are not necessarily comparable. The percentage data would subsequently be used in performing statistical analyses (regression, etc.). Please advise Liviu [1] http://sci.tech-archive.net/Archive/sci.stat.math/2006-04/msg00544.html [2] http://sci.tech-archive.net/Archive/sci.stat.math/2006-04/msg00548.html ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.