Hi Kingsley,

On 17 January 2010 at 02:41, Kingsley G. Morse Jr. wrote:
| 
| Hi Dirk,
| 
| Thanks for maintaining Debian's gsl-bin package.
| 
| Its "gsl-histogram" utility looks like it could be
| a real time saver.
| 
| Unfortunately, it seems to me that I may have
| stumbled upon a circumstance where it drops data.
| 
| Here's how to elicit the bug:
| 
|     $ echo -e "1\n2\n3" | gsl-histogram 1 3 4
| 
| At least on my computer, the resulting output is
| 
|     1 1.5 1
|     1.5 2 0
|     2 2.5 1
|     2.5 3 0
| 
| I'd like to draw your attention to the last line.
| 
| It says no data points are in the last bin, which
| has a maximum value of 3.
| 
| However, you can see that
| 
|     a.) "3" was passed as gsl-histogram's second
|     parameter, which the man page names "xmax",
|     and
| 
|     b.) the original echo statement definitely
|     emitted a "3".
| 
| It seems to me that the data point equal to "3" is
| dropped.
| 
| Perhaps gsl-histogram could be changed from
| checking if data is 
| 
|     "less than" 
|     
| xmax, to checking if data is 
| 
|     "less than or equal to"
| 
| xmax.

Well, it behaves as documented, see e.g. Section 21.1 about the definition of
the histogram struct in the gsl-ref documentation (available in several
Debian packages):


21.1 The histogram struct
=========================

A histogram is defined by the following struct,

 -- Data Type: gsl_histogram
    `size_t n'
          This is the number of histogram bins

    `double * range'
          The ranges of the bins are stored in an array of N+1 elements
          pointed to by RANGE.

    `double * bin'
          The counts for each bin are stored in an array of N elements
          pointed to by BIN.  The bins are floating-point numbers, so
          you can increment them by non-integer values if necessary.

The range for BIN[i] is given by RANGE[i] to RANGE[i+1].  For n bins
there are n+1 entries in the array RANGE.  Each bin is inclusive at the
lower end and exclusive at the upper end.  Mathematically this means
that the bins are defined by the following inequality,
     bin[i] corresponds to range[i] <= x < range[i+1]

Here is a diagram of the correspondence between ranges and bins on the
number-line for x,


          [ bin[0] )[ bin[1] )[ bin[2] )[ bin[3] )[ bin[4] )
       ---|---------|---------|---------|---------|---------|---  x
        r[0]      r[1]      r[2]      r[3]      r[4]      r[5]

In this picture the values of the RANGE array are denoted by r.  On the
left-hand side of each bin the square bracket `[' denotes an inclusive
lower bound (r <= x), and the round parentheses `)' on the right-hand
side denote an exclusive upper bound (x < r).  Thus any samples which
fall on the upper end of the histogram are excluded.  If you want to
include this value for the last bin you will need to add an extra bin
to your histogram.

   The `gsl_histogram' struct and its associated functions are defined
in the header file `gsl_histogram.h'.



The key is the    [ lower, higher )  range which does not include 'higher'.

Doing more stats with R than with the GSL myself, I just checked the
documentation 'help(hist)' in R and there I see the inverse:



Details:

     The definition of _histogram_ differs by source (with
     country-specific biases).  R's default with equi-spaced breaks
     (also the default) is to plot the counts in the cells defined by
     ‘breaks’.  Thus the height of a rectangle is proportional to the
     number of points falling into the cell, as is the area _provided_
     the breaks are equally-spaced.

     The default with non-equi-spaced breaks is to give a plot of area
     one, in which the _area_ of the rectangles is the fraction of the
     data points falling in the cells.

     If ‘right = TRUE’ (default), the histogram cells are intervals of
     the form ‘(a, b]’, i.e., they include their right-hand endpoint,
     but not their left one, with the exception of the first cell when
     ‘include.lowest’ is ‘TRUE’.

     For ‘right = FALSE’, the intervals are of the form ‘[a, b)’, and
     ‘include.lowest’ means ‘_include highest_’.



so I presume these choices can, and probably have been, debated to death.

As gsl does exactly what its documentation says it will do, we have no bug.
You could start a discussion on the gsl mailing about allowing an option to
flip this just like R does, but this is clearly outside the scope of a Debian
issue.  So I am closing this one, ok?

Regards, Dirk
 
| Thanks,
| Kingsley
| 
| -- System Information:
| Debian Release: lenny/sid
|   APT prefers unstable
|   APT policy: (990, 'unstable'), (1, 'experimental')
| Architecture: i386 (i686)
| 
| Kernel: Linux 2.6.25-2-686 (SMP w/2 CPU cores)
| Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1)
| Shell: /bin/sh linked to /bin/bash
| 
| Versions of packages gsl-bin depends on:
| ii  libc6                         2.9-6      GNU C Library: Shared libraries
| ii  libgsl0ldbl                   1.10-1     GNU Scientific Library (GSL) -- 
li
| 
| gsl-bin recommends no packages.
| 
| gsl-bin suggests no packages.
| 
| -- no debconf information
| 
| 

-- 
Three out of two people have difficulties with fractions.



--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]

Reply via email to