Hello all.

I noticed that the default setting for breaks in the construction of histograms 
in hist() is “right = TRUE”.

I think “right=FALSE” would be more consistent with usual definitions of lower 
and upper limits for bins in applied statistics, and I suggest that you 
consider making it the default for hist().

For example, I generated the following frequency distribution for duration of 
hospitalization with a script in R specifying the cuts to be “right = FALSE” 
(from an exercise in Bernard Rosner’s Fundamentals of Biostatistics book).  

                number     %
[0,5)             5         0.20
[5,10)         12         0.48
[10,15)         6         0.24
[15,20)         1         0.04
[20,25)         0         0.00
[25,30]         1         0.04

The actual boundaries for each bin are: 0-4, 5-9, 10-14, … and so on since the 
limits on the right are “open”, with the exception of the last bin. This format 
is in agreement with usual practice and recommendations. Actually, it is 
compatible with the process described by Romer in his book (“from y inclusive 
to y exclusive”).

If I use R to generate a histogram with 6 bins, I get the following:

Attachment: histogram1.pdf
Description: Adobe PDF document

… which actually presents the histogram of the frequency distribution when the 
“right” parameter is set as “TRUE”: 


               number     %

[0,5]             9         0.36
(5,10]           9         0.36
(10,15]         5         0.20
(15,20]         1         0.04
(20,25]         0         0.00
(25,30]         1         0.04

In this case, the real limits of the bins are 0-5, 6-10, 11-15, … and so on.

If I edit the histogram command adding “right = FALSE”, I can get the histogram 
for my original frequency distribution. Compare bins 1 and 2 in both 
distributions and histograms.


Attachment: Histogram2.pdf
Description: Adobe PDF document



The actual choice of the argument for the “right” parameter may be a matter of 
choice, but I think most users of R would benefit from using bins with limits 
that are closed to the left and open to the right, and so having this setting 
as a default for hist().

I am aware I am writing from the limited perspective of my own field 
(epidemiology and biostatistics), but there are plenty of examples that show 
the need to consider changing the default. Here are just a few:

https://www.statcan.gc.ca/eng/concepts/definitions/age2

https://seer.cancer.gov/stdpopulations/stdpop.19ages.html

https://www.census.gov/data/tables/time-series/demo/income-poverty/cps-hinc/hinc-01.html


Thank you.

José 

José G. Conde, MD, MPH
Professor, School of Medicine
Director, CentIT2
UPR Medical Sciences Campus 

Tel  (787) 763-9401 Fax (787) 758-5206

Email: jose.con...@upr.edu

URL: http://rcmi.rcm.upr.edu

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to