Package: bogofilter Version: 1.1.3-1 Severity: serious I report this as "serious" because this _should_ be fixed before Etch is released. This bug causes bogofilter to work incorrectly in UTF-8 systems (which is Etch's default).
Debian Etch uses UTF-8 locales and charset as default. Bogofilter uses ISO-8859-1 as the system default. This usually causes garbage words to user's ~/.bogofilter/wordlist.db since the default charset for _database_ is Unicode/UTF-8. To reproduce: Use UTF-8 locale: $ locale LANG=fi_FI.UTF-8 LC_CTYPE="fi_FI.UTF-8" LC_NUMERIC="fi_FI.UTF-8" LC_TIME="fi_FI.UTF-8" LC_COLLATE="fi_FI.UTF-8" LC_MONETARY="fi_FI.UTF-8" LC_MESSAGES=en_US.UTF-8 LC_PAPER="fi_FI.UTF-8" LC_NAME="fi_FI.UTF-8" LC_ADDRESS="fi_FI.UTF-8" LC_TELEPHONE="fi_FI.UTF-8" LC_MEASUREMENT="fi_FI.UTF-8" LC_IDENTIFICATION="fi_FI.UTF-8" LC_ALL= Use Bogofilter's default system charset (or define it to some other 8 bit charset): charset_default=iso-8859-1 Use Bogofilter's default word database charset (Unicode/UTF-8): unicode=yes (These can be defined in /etc/bogofilter.cf or ~/.bogofilter.cf) Some background information: letter "ä" is U+00E4 LATIN SMALL LETTER A WITH DIAERESIS and in UTF-8 encoding it takes two bytes: $c3 $a4. $ mv ~/.bogofilter/wordlist.db ~/wordlist.db-backup $ echo "äiti" | bogofilter -n $ bogoutil -d ~/.bogofilter/wordlist.db head:äiti 0 1 20061213 This example shows that the letter "ä" is encoded _twice_ with UTF-8. The command "echo" prints letter "ä" encoded with UTF-8, Bogofilter thinks it is in ISO-8859-1 and encodes both bytes separately: $c3 becomes "Ã" (U+00C3 LATIN CAPITAL LETTER A WITH TILDE) and $a4 becomes "¤" (U+00A4 CURRENCY SIGN). Having lines charset_default=utf-8 unicode=yes in /etc/bogofilter.cf file characters are encoded correctly. -- System Information: Debian Release: 4.0 APT prefers testing APT policy: (900, 'testing') Architecture: i386 (i686) Shell: /bin/sh linked to /bin/dash Kernel: Linux 2.6.18-3-k7 Locale: LANG=fi_FI.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8) Versions of packages bogofilter depends on: ii bogofilter-bdb 1.1.3-1 a fast Bayesian spam filter (Berke bogofilter recommends no packages. -- no debconf information