Re: [R-sig-eco] Log transforming zero value data

Carsten Dormann Wed, 24 Jun 2009 02:43:47 -0700

Dear Philippe,

while I don't like to quibble about rules-of-thumb (since they are, asyou rightly point out, without foundation in statistical theory), Iwould like to correct the impression that you gave in your email.Let's take an hypothetical example along the lines you proposed (or atleast of what I understand you proposed).

x <- rnorm(100, mean=37, sd=1) # human body temperature
How could we here have zeros?

If we had zero (from a dead body in the snow, for example), then bytransforming the values to Kelvin the value for c should indeed be 271,not 1! What a strange idea to argue that because a body was dead in thesnow, we can assume it has the same temperature as the universe! No,sorry, I prefer to stick to the "VERY complicated" rule of thumb ofusing half the non-zero minimum.

In my experience, zeros could arise in the following circumstance: Wetransplant seedlings into some treatments. Some don't survive, othersthrive. So, at the end of the experiment, we have the choice of giventhe dead seedlings a weight of 0 or NA. If we choose 0, then we areconfounding two processes: survival and growth under treatmentconditions. That's what I meant when I wrote that the values come fromdifferent processes. We could/should opt for a mixed distributionapproach (e.g what Zuur et al. refer to as ZAP models, in the book thatGavin mentioned).Or we could choose to transform the 0s to match the biomass of thesurviving individuals.Here come's Philippe's criticism of the two rules-of-thumb: We can usegrams, or milligrams, or tons, and all the time the c-value wouldchange. Correct: it would. And rightly so, I think. Why should the valueof c be some natural constant (such as 1)? Of course we seek to adjustit to the distribution of the data, because we are actually imputing (ina sense) a value we know that cannot be right as it is. Therefore itseems obvious to me that we must have rules of thumb that providedifferent values for different data sets.In fact, c=1 is not a constant value. Because we log-transform the data,the c=1 added to 1 is large (turning a log(1)=0 into a log(2)=0.7),while the same c=1 added to 28377 is small (turing a log(28377)=10.25333into a 10.25337). I think this gives an impression of the distortion weadd: negligible at the upper end, substantial at the lower. The secondrule of thumb tries to balance that.


And, really, these rules of thumb are not "VERY complicated":

for c(342, 234, 132, 1441, 2, 4443, 23434, 0) rule 1 proposed c=1 (asin 2/2) and rule 2 proposes 4.5.

quantile() yields:
 0%     25%     50%     75%    100%
 0.0    99.5   288.0  2191.5 23434.0
99.5*99.5/2191.5 = 4.5

Do you think that is VERY complicated?

Anyone volunteering for a little simulation study, analysing whichrule-of-thumb would be best? Meet the candidates:

1. Eliminator (get rid of zeros)
2. Oner (log1p)
3. little-bitter (half of the smallest non-zero value)
4. quantiler (ratio of squared first and third quantile)

set.seed(11111)
x1 <- rlnorm(100, 5, 2)
x2 <- rlnorm(100, 5.5, 2)
t.test(log(x1), log(x2))
x1[sample(1:100, size=40)] <- 0
x2[sample(1:100, size=40)] <- 0

1. Eliminator:
t.test(log(x1[x1>0]), log(x2[x2>0])) #significant
2. Oner:
t.test(log1p(x1), log1p(x2)) # not significant
3. little-bitter:
sort(unique(c(x1,x2)))[2]/2 # approx. 0.6
t.test(log(x1+.6), log(x2+.6)) # not significant
4. quantiler
quantile(c(x1,x2))

# Ah, now that is interesting: too many zeros and you cannot use thisrule-of-thumb!

quantile(c(x1,x2)[c(x1,x2)>0])
# (34^2)/541 = 2.1
t.test(log(x1+2.1), log(x2+2.1)) # not significant

I wouldn't go so far as to claim that this is a real test of thecontestants, it merely outlines a possible approach for doing so. In anycase, neither rule-of-thump leads to "true" significance, only theeliminator!

I'll be on holiday for the next weeks, in order to avoid all furtherdiscussions ...


Carsten



Philippe Grosjean wrote:

Carsten Dormann wrote:
Dear Nate,
although I learned from Phillippe's response about the existence oflog1p, I don't think I will use it (for reasons below). Thierry'sresponse is true for Poisson data, but not for non-integer values.Still, it points into an important direction: All too often zerosemanate from a different process than the other values (see mixeddistributions, zero-inflated, hurdle and all that). In that case, youshould consult Ben Bolker's excellent book (which is probably stillavailable as a draft on his homepage, but also worth buying).
If you want to transform, here is my take:

My folk-law guidelines on the c in log(x+c) are:
1. c should roughly be 1/2 of the smallest, non-zero value:signif(0.5*sort(unique(x))[2], 2)2. c should be quadrat of the first quantile devided by the thirdquantile: (quantile(x)[2]^2)/quantile(x)[4]
For example:
set.seed(11011)
x <- c(runif(95), rep(0,5))

Method 1: c=0.0015
Method 2: c=0.015
While this looks like a huge difference (an order of magnitude), itactually isn't all that much, given the range of the data:
plot(density(x))
abline(v=c(0.0015, 0.015))
These are VERY complicated rules for just an empirical rule of thumbwithout connection to theoretical background. Moreover, c is dependingon the dataset, and it is thus changing from dataset to dataset, whichis NOT a desired behavior.
So, providing maximum values are large (100s or more), and minimumvalue above zero not too small (let's 1, e.g., for countings), ln(x+1)alias log1p() is convenient because log1p(0) = 0. So, given we userules of thumb, this one looks good because (1) it transforms zero tozero, and (2) transformation is independent from the content of thedata. But I agree it is not a good choice when you deal with smallvalues.
Now, if you want to be more accurate, you have to determine the actualdistribution of your data. If your data are (generalized) log-normallydistributed, c must be defined according to what you know about thevariable you measure. For instance, temperature expressed in °C wouldrequire to choose c as being 273.15 to be correct... very far awayfrom 1, or from c that would be used with the rules you propose!
Best,

Philippe
I do have a reference for method 2, but it is German (Stahel, W. A.(2002) Statistische Datenanalyse. Eine Einführung fürNaturwissenschaftler. Vieweg, Braunschweig.)._ Method 1 is what my PhD's statistics adviser recommended. Since hewas right in everything else, I rely on his advise here, too. Thatmay, I acknowledge, not be good enough for you. But maybe someoneelse finds a proper reference.
The key thing for any value of c is that it doesn't distort theanalysis. But then, how do you detect distortion? I used a comparisonof rank-transformed data and various values of c. When c was large(in the current example e.g. 0.5 or so), the analysis started todiffer from the rank-analysis. To use log1p here would be a dramaticdistortion!
Another way to look at it is through Box-Cox-transformation. SinceBox-Cox transforms towards symmetric (not necessarily normal)distribution, also c should be chosen in such a way as to facilitatethe transformation towards symmetry.
HTH,

Carsten


Nate Upham wrote:
I have a general stats question for you guys:
How does one normally deal with zero (0) values when logtransforming data? I would like to log transform (natural log, ln)several response variables for use in quantileregression. But one of my variables includes several zero values.Since ln(0) = infinity, this isnot readily possible. Is it best to remove all data with zerovalues? Or should I add a very smallnumber to each value (e.g., 0.00001)? This seems problematic. Isthere an easy way to address this
issue?

Thanks much for your help,
--Nate

_________________________________
Nathan S. Upham
Ph.D. student
Committee on Evolutionary Biology
University of Chicago
1025 E. 57th St., Culver 402
Chicago, IL 60637
nsup...@uchicago.edu

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


--
Dr. Carsten F. Dormann
Department of Computational Landscape Ecology

Helmholtz Centre for Environmental Research-UFZPermoserstr. 15

04318 Leipzig
Germany

Tel: ++49(0)341 2351946
Fax: ++49(0)341 2351939
Email: carsten.dorm...@ufz.de
internet: http://www.ufz.de/index.php?de=4205

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Re: [R-sig-eco] Log transforming zero value data

Reply via email to