Your message dated Sun, 17 Jan 2010 06:46:34 -0600 with message-id <[email protected]> and subject line Re: Bug#565600: gsl-bin: gsl-histogram drops data equal to xmax has caused the Debian Bug report #565600, regarding gsl-bin: gsl-histogram drops data equal to xmax to be marked as done.
This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact [email protected] immediately.) -- 565600: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=565600 Debian Bug Tracking System Contact [email protected] with problems
--- Begin Message ---Package: gsl-bin Version: 1.13+dfsg-1 Severity: normal Hi Dirk, Thanks for maintaining Debian's gsl-bin package. Its "gsl-histogram" utility looks like it could be a real time saver. Unfortunately, it seems to me that I may have stumbled upon a circumstance where it drops data. Here's how to elicit the bug: $ echo -e "1\n2\n3" | gsl-histogram 1 3 4 At least on my computer, the resulting output is 1 1.5 1 1.5 2 0 2 2.5 1 2.5 3 0 I'd like to draw your attention to the last line. It says no data points are in the last bin, which has a maximum value of 3. However, you can see that a.) "3" was passed as gsl-histogram's second parameter, which the man page names "xmax", and b.) the original echo statement definitely emitted a "3". It seems to me that the data point equal to "3" is dropped. Perhaps gsl-histogram could be changed from checking if data is "less than" xmax, to checking if data is "less than or equal to" xmax. Thanks, Kingsley -- System Information: Debian Release: lenny/sid APT prefers unstable APT policy: (990, 'unstable'), (1, 'experimental') Architecture: i386 (i686) Kernel: Linux 2.6.25-2-686 (SMP w/2 CPU cores) Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1) Shell: /bin/sh linked to /bin/bash Versions of packages gsl-bin depends on: ii libc6 2.9-6 GNU C Library: Shared libraries ii libgsl0ldbl 1.10-1 GNU Scientific Library (GSL) -- li gsl-bin recommends no packages. gsl-bin suggests no packages. -- no debconf information
--- End Message ---
--- Begin Message ---Hi Kingsley, On 17 January 2010 at 02:41, Kingsley G. Morse Jr. wrote: | | Hi Dirk, | | Thanks for maintaining Debian's gsl-bin package. | | Its "gsl-histogram" utility looks like it could be | a real time saver. | | Unfortunately, it seems to me that I may have | stumbled upon a circumstance where it drops data. | | Here's how to elicit the bug: | | $ echo -e "1\n2\n3" | gsl-histogram 1 3 4 | | At least on my computer, the resulting output is | | 1 1.5 1 | 1.5 2 0 | 2 2.5 1 | 2.5 3 0 | | I'd like to draw your attention to the last line. | | It says no data points are in the last bin, which | has a maximum value of 3. | | However, you can see that | | a.) "3" was passed as gsl-histogram's second | parameter, which the man page names "xmax", | and | | b.) the original echo statement definitely | emitted a "3". | | It seems to me that the data point equal to "3" is | dropped. | | Perhaps gsl-histogram could be changed from | checking if data is | | "less than" | | xmax, to checking if data is | | "less than or equal to" | | xmax. Well, it behaves as documented, see e.g. Section 21.1 about the definition of the histogram struct in the gsl-ref documentation (available in several Debian packages): 21.1 The histogram struct ========================= A histogram is defined by the following struct, -- Data Type: gsl_histogram `size_t n' This is the number of histogram bins `double * range' The ranges of the bins are stored in an array of N+1 elements pointed to by RANGE. `double * bin' The counts for each bin are stored in an array of N elements pointed to by BIN. The bins are floating-point numbers, so you can increment them by non-integer values if necessary. The range for BIN[i] is given by RANGE[i] to RANGE[i+1]. For n bins there are n+1 entries in the array RANGE. Each bin is inclusive at the lower end and exclusive at the upper end. Mathematically this means that the bins are defined by the following inequality, bin[i] corresponds to range[i] <= x < range[i+1] Here is a diagram of the correspondence between ranges and bins on the number-line for x, [ bin[0] )[ bin[1] )[ bin[2] )[ bin[3] )[ bin[4] ) ---|---------|---------|---------|---------|---------|--- x r[0] r[1] r[2] r[3] r[4] r[5] In this picture the values of the RANGE array are denoted by r. On the left-hand side of each bin the square bracket `[' denotes an inclusive lower bound (r <= x), and the round parentheses `)' on the right-hand side denote an exclusive upper bound (x < r). Thus any samples which fall on the upper end of the histogram are excluded. If you want to include this value for the last bin you will need to add an extra bin to your histogram. The `gsl_histogram' struct and its associated functions are defined in the header file `gsl_histogram.h'. The key is the [ lower, higher ) range which does not include 'higher'. Doing more stats with R than with the GSL myself, I just checked the documentation 'help(hist)' in R and there I see the inverse: Details: The definition of _histogram_ differs by source (with country-specific biases). R's default with equi-spaced breaks (also the default) is to plot the counts in the cells defined by ‘breaks’. Thus the height of a rectangle is proportional to the number of points falling into the cell, as is the area _provided_ the breaks are equally-spaced. The default with non-equi-spaced breaks is to give a plot of area one, in which the _area_ of the rectangles is the fraction of the data points falling in the cells. If ‘right = TRUE’ (default), the histogram cells are intervals of the form ‘(a, b]’, i.e., they include their right-hand endpoint, but not their left one, with the exception of the first cell when ‘include.lowest’ is ‘TRUE’. For ‘right = FALSE’, the intervals are of the form ‘[a, b)’, and ‘include.lowest’ means ‘_include highest_’. so I presume these choices can, and probably have been, debated to death. As gsl does exactly what its documentation says it will do, we have no bug. You could start a discussion on the gsl mailing about allowing an option to flip this just like R does, but this is clearly outside the scope of a Debian issue. So I am closing this one, ok? Regards, Dirk | Thanks, | Kingsley | | -- System Information: | Debian Release: lenny/sid | APT prefers unstable | APT policy: (990, 'unstable'), (1, 'experimental') | Architecture: i386 (i686) | | Kernel: Linux 2.6.25-2-686 (SMP w/2 CPU cores) | Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1) | Shell: /bin/sh linked to /bin/bash | | Versions of packages gsl-bin depends on: | ii libc6 2.9-6 GNU C Library: Shared libraries | ii libgsl0ldbl 1.10-1 GNU Scientific Library (GSL) -- li | | gsl-bin recommends no packages. | | gsl-bin suggests no packages. | | -- no debconf information | | -- Three out of two people have difficulties with fractions.
--- End Message ---

