Package: bogofilter
Version: 1.1.3-1
Severity: serious

I report this as "serious" because this _should_ be fixed before Etch is
released. This bug causes bogofilter to work incorrectly in UTF-8
systems (which is Etch's default).

Debian Etch uses UTF-8 locales and charset as default. Bogofilter uses
ISO-8859-1 as the system default. This usually causes garbage words to
user's ~/.bogofilter/wordlist.db since the default charset for
_database_ is Unicode/UTF-8.

To reproduce:


Use UTF-8 locale:

$ locale
LANG=fi_FI.UTF-8
LC_CTYPE="fi_FI.UTF-8"
LC_NUMERIC="fi_FI.UTF-8"
LC_TIME="fi_FI.UTF-8"
LC_COLLATE="fi_FI.UTF-8"
LC_MONETARY="fi_FI.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="fi_FI.UTF-8"
LC_NAME="fi_FI.UTF-8"
LC_ADDRESS="fi_FI.UTF-8"
LC_TELEPHONE="fi_FI.UTF-8"
LC_MEASUREMENT="fi_FI.UTF-8"
LC_IDENTIFICATION="fi_FI.UTF-8"
LC_ALL=


Use Bogofilter's default system charset (or define it to some
other 8 bit charset):
  charset_default=iso-8859-1
Use Bogofilter's default word database charset (Unicode/UTF-8):
  unicode=yes

(These can be defined in /etc/bogofilter.cf or ~/.bogofilter.cf)


Some background information: letter "ä" is U+00E4 LATIN SMALL LETTER
A WITH DIAERESIS and in UTF-8 encoding it takes two bytes: $c3 $a4.


$ mv ~/.bogofilter/wordlist.db ~/wordlist.db-backup
$ echo "äiti" | bogofilter -n
$ bogoutil -d ~/.bogofilter/wordlist.db
head:äiti 0 1 20061213

This example shows that the letter "ä" is encoded _twice_ with UTF-8.
The command "echo" prints letter "ä" encoded with UTF-8, Bogofilter
thinks it is in ISO-8859-1 and encodes both bytes separately: $c3
becomes "Ã" (U+00C3 LATIN CAPITAL LETTER A WITH TILDE) and $a4 becomes
"¤" (U+00A4 CURRENCY SIGN).

Having lines
  charset_default=utf-8
  unicode=yes
in /etc/bogofilter.cf file characters are encoded correctly.


-- System Information:
Debian Release: 4.0
  APT prefers testing
  APT policy: (900, 'testing')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/dash
Kernel: Linux 2.6.18-3-k7
Locale: LANG=fi_FI.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8)

Versions of packages bogofilter depends on:
ii  bogofilter-bdb                1.1.3-1    a fast Bayesian spam filter (Berke

bogofilter recommends no packages.

-- no debconf information

Reply via email to