Your message dated Sun, 17 Jan 2010 06:46:34 -0600
with message-id <[email protected]>
and subject line Re: Bug#565600: gsl-bin: gsl-histogram drops data equal to xmax
has caused the Debian Bug report #565600,
regarding gsl-bin: gsl-histogram drops data equal to xmax
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
565600: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=565600
Debian Bug Tracking System
Contact [email protected] with problems
--- Begin Message ---
Package: gsl-bin
Version: 1.13+dfsg-1
Severity: normal


Hi Dirk,

Thanks for maintaining Debian's gsl-bin package.

Its "gsl-histogram" utility looks like it could be
a real time saver.

Unfortunately, it seems to me that I may have
stumbled upon a circumstance where it drops data.

Here's how to elicit the bug:

    $ echo -e "1\n2\n3" | gsl-histogram 1 3 4

At least on my computer, the resulting output is

    1 1.5 1
    1.5 2 0
    2 2.5 1
    2.5 3 0

I'd like to draw your attention to the last line.

It says no data points are in the last bin, which
has a maximum value of 3.

However, you can see that

    a.) "3" was passed as gsl-histogram's second
    parameter, which the man page names "xmax",
    and

    b.) the original echo statement definitely
    emitted a "3".

It seems to me that the data point equal to "3" is
dropped.

Perhaps gsl-histogram could be changed from
checking if data is 

    "less than" 
    
xmax, to checking if data is 

    "less than or equal to"

xmax.

Thanks,
Kingsley

-- System Information:
Debian Release: lenny/sid
  APT prefers unstable
  APT policy: (990, 'unstable'), (1, 'experimental')
Architecture: i386 (i686)

Kernel: Linux 2.6.25-2-686 (SMP w/2 CPU cores)
Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/bash

Versions of packages gsl-bin depends on:
ii  libc6                         2.9-6      GNU C Library: Shared libraries
ii  libgsl0ldbl                   1.10-1     GNU Scientific Library (GSL) -- li

gsl-bin recommends no packages.

gsl-bin suggests no packages.

-- no debconf information



--- End Message ---
--- Begin Message ---
Hi Kingsley,

On 17 January 2010 at 02:41, Kingsley G. Morse Jr. wrote:
| 
| Hi Dirk,
| 
| Thanks for maintaining Debian's gsl-bin package.
| 
| Its "gsl-histogram" utility looks like it could be
| a real time saver.
| 
| Unfortunately, it seems to me that I may have
| stumbled upon a circumstance where it drops data.
| 
| Here's how to elicit the bug:
| 
|     $ echo -e "1\n2\n3" | gsl-histogram 1 3 4
| 
| At least on my computer, the resulting output is
| 
|     1 1.5 1
|     1.5 2 0
|     2 2.5 1
|     2.5 3 0
| 
| I'd like to draw your attention to the last line.
| 
| It says no data points are in the last bin, which
| has a maximum value of 3.
| 
| However, you can see that
| 
|     a.) "3" was passed as gsl-histogram's second
|     parameter, which the man page names "xmax",
|     and
| 
|     b.) the original echo statement definitely
|     emitted a "3".
| 
| It seems to me that the data point equal to "3" is
| dropped.
| 
| Perhaps gsl-histogram could be changed from
| checking if data is 
| 
|     "less than" 
|     
| xmax, to checking if data is 
| 
|     "less than or equal to"
| 
| xmax.

Well, it behaves as documented, see e.g. Section 21.1 about the definition of
the histogram struct in the gsl-ref documentation (available in several
Debian packages):


21.1 The histogram struct
=========================

A histogram is defined by the following struct,

 -- Data Type: gsl_histogram
    `size_t n'
          This is the number of histogram bins

    `double * range'
          The ranges of the bins are stored in an array of N+1 elements
          pointed to by RANGE.

    `double * bin'
          The counts for each bin are stored in an array of N elements
          pointed to by BIN.  The bins are floating-point numbers, so
          you can increment them by non-integer values if necessary.

The range for BIN[i] is given by RANGE[i] to RANGE[i+1].  For n bins
there are n+1 entries in the array RANGE.  Each bin is inclusive at the
lower end and exclusive at the upper end.  Mathematically this means
that the bins are defined by the following inequality,
     bin[i] corresponds to range[i] <= x < range[i+1]

Here is a diagram of the correspondence between ranges and bins on the
number-line for x,


          [ bin[0] )[ bin[1] )[ bin[2] )[ bin[3] )[ bin[4] )
       ---|---------|---------|---------|---------|---------|---  x
        r[0]      r[1]      r[2]      r[3]      r[4]      r[5]

In this picture the values of the RANGE array are denoted by r.  On the
left-hand side of each bin the square bracket `[' denotes an inclusive
lower bound (r <= x), and the round parentheses `)' on the right-hand
side denote an exclusive upper bound (x < r).  Thus any samples which
fall on the upper end of the histogram are excluded.  If you want to
include this value for the last bin you will need to add an extra bin
to your histogram.

   The `gsl_histogram' struct and its associated functions are defined
in the header file `gsl_histogram.h'.



The key is the    [ lower, higher )  range which does not include 'higher'.

Doing more stats with R than with the GSL myself, I just checked the
documentation 'help(hist)' in R and there I see the inverse:



Details:

     The definition of _histogram_ differs by source (with
     country-specific biases).  R's default with equi-spaced breaks
     (also the default) is to plot the counts in the cells defined by
     ‘breaks’.  Thus the height of a rectangle is proportional to the
     number of points falling into the cell, as is the area _provided_
     the breaks are equally-spaced.

     The default with non-equi-spaced breaks is to give a plot of area
     one, in which the _area_ of the rectangles is the fraction of the
     data points falling in the cells.

     If ‘right = TRUE’ (default), the histogram cells are intervals of
     the form ‘(a, b]’, i.e., they include their right-hand endpoint,
     but not their left one, with the exception of the first cell when
     ‘include.lowest’ is ‘TRUE’.

     For ‘right = FALSE’, the intervals are of the form ‘[a, b)’, and
     ‘include.lowest’ means ‘_include highest_’.



so I presume these choices can, and probably have been, debated to death.

As gsl does exactly what its documentation says it will do, we have no bug.
You could start a discussion on the gsl mailing about allowing an option to
flip this just like R does, but this is clearly outside the scope of a Debian
issue.  So I am closing this one, ok?

Regards, Dirk
 
| Thanks,
| Kingsley
| 
| -- System Information:
| Debian Release: lenny/sid
|   APT prefers unstable
|   APT policy: (990, 'unstable'), (1, 'experimental')
| Architecture: i386 (i686)
| 
| Kernel: Linux 2.6.25-2-686 (SMP w/2 CPU cores)
| Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1)
| Shell: /bin/sh linked to /bin/bash
| 
| Versions of packages gsl-bin depends on:
| ii  libc6                         2.9-6      GNU C Library: Shared libraries
| ii  libgsl0ldbl                   1.10-1     GNU Scientific Library (GSL) -- 
li
| 
| gsl-bin recommends no packages.
| 
| gsl-bin suggests no packages.
| 
| -- no debconf information
| 
| 

-- 
Three out of two people have difficulties with fractions.


--- End Message ---

Reply via email to