Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
Clint Adams kirjoitti (26.12.2006 klo 19.26): charset_default sets the charset assumed for messages without proper headers. I have seen no instances of mail in the wild where the charset was unspecified yet was actually proper UTF-8. I didn't know that bogofilter is able to check message headers for correct encoding. I use KMail (KDE's email client) and it converts messages to locale charset before sending them to bogofilter. How do other programs behave? What is the correct behaviour (if there is one)? If this is just KMail's problem that bogofilter database gets (practically) corrupted when locale's charset is different than bogofilter's charset_default, then, yes, this is not a bogofilter bug. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
Clint Adams kirjoitti (24.12.2006 klo 9.23): Having lines charset_default=utf-8 unicode=yes Isn't unicode=yes already the default? Yes, it is the default. I think it's a good idea to define unicode=yes explicitly because defaults may change (in this case, I don't believe it will, though). -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
Teemu Likonen, le Wed 27 Dec 2006 10:23:39 +0200, a écrit : I didn't know that bogofilter is able to check message headers for correct encoding. I use KMail (KDE's email client) and it converts messages to locale charset before sending them to bogofilter. How do other programs behave? What is the correct behaviour (if there is one)? I'd say the correct behavior is to just keep the message intact. Samuel
Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
Samuel Thibault kirjoitti (27.12.2006 klo 10.49): Teemu Likonen, le Wed 27 Dec 2006 10:23:39 +0200, a écrit : I didn't know that bogofilter is able to check message headers for correct encoding. I use KMail (KDE's email client) and it converts messages to locale charset before sending them to bogofilter. How do other programs behave? What is the correct behaviour (if there is one)? I'd say the correct behavior is to just keep the message intact. I checked how bogofilter works with messages with different encodings and Content-Type headers. Bogofilter works as it should: it checks the message's Content-Type header and get's the charset from there. With unicode=yes (which is the default) bogofilter converts the message to UTF-8 and stores words to it's database. If charset is not defined in message's Content-Type headers, bogofilter uses it's own charset_default setting (default is ISO-8859-1). I think ISO-8859-1 is a good default: I believe most of the messages without Content-Type headers are in some kind of Western European charset. Probably most of the spam is English. So, my bug report was pretty pointless from bogofilter's point of view. :) I guess this bug can be closed. At least I downgraded the severity to normal. There remains this KMail problem, though. Maybe it's worth filing a new report.
Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
Teemu Likonen, le Thu 28 Dec 2006 00:16:20 +0200, a écrit : If charset is not defined in message's Content-Type headers, bogofilter uses it's own charset_default setting (default is ISO-8859-1). I think ISO-8859-1 is a good default: I believe most of the messages without Content-Type headers are in some kind of Western European charset. Probably most of the spam is English. Maybe cp1252 would even be more useful, since it is an over-set of iso-8859-1 and it is used by a lot of mailers running on another well-known OS. Samuel
Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
Samuel Thibault kirjoitti (27.12.2006 klo 23.25): Maybe cp1252 would even be more useful, since it is an over-set of iso-8859-1 and it is used by a lot of mailers running on another well-known OS. Indeed. Then it would be charset_default=Windows-1252 or charset_default=cp1252. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
Clint Adams, le Sun 24 Dec 2006 09:23:44 -0500, a écrit : Having lines charset_default=utf-8 unicode=yes Isn't unicode=yes already the default? Nope, iso-8859-1 is (see configure.ac). But it could be by applying the attached patch (yes, I had to fix the configure.ac script). But actually, text tools should rather use the current locale's charset (from nl_langinfo(CODESET)), instead of hardcoding it in configuration files... Samuel diff -ur bogofilter-1.1.3/configure bogofilter-1.1.3-mine/configure --- bogofilter-1.1.3/configure 2006-12-03 05:17:15.0 +0100 +++ bogofilter-1.1.3-mine/configure 2006-12-27 01:06:32.0 +0100 @@ -6137,6 +6137,7 @@ #define DEFAULT_CHARSET $withval _ACEOF + DEFAULT_CHARSET=$withval fi diff -ur bogofilter-1.1.3/configure.ac bogofilter-1.1.3-mine/configure.ac --- bogofilter-1.1.3/configure.ac 2006-12-03 04:55:30.0 +0100 +++ bogofilter-1.1.3-mine/configure.ac 2006-12-27 01:05:28.0 +0100 @@ -336,6 +336,7 @@ AC_DEFINE_UNQUOTED(DEFAULT_CHARSET, [$withval], [Use specified default charset instead of iso-8859-1]) + [DEFAULT_CHARSET=$withval] ) AC_SUBST(ENCODING) Seulement dans bogofilter-1.1.3-mine: configure-stamp diff -ur bogofilter-1.1.3/debian/rules bogofilter-1.1.3-mine/debian/rules --- bogofilter-1.1.3/debian/rules 2006-12-27 01:05:50.0 +0100 +++ bogofilter-1.1.3-mine/debian/rules 2006-12-27 01:09:19.0 +0100 @@ -26,11 +26,11 @@ $(INSTALL) -d obj-db obj-qdbm obj-sqlite - cd obj-db CFLAGS=$(CFLAGS) ../configure --with-database=db \ + cd obj-db CFLAGS=$(CFLAGS) ../configure --with-database=db --with-charset=utf-8 \ --prefix=/usr --mandir=\$${prefix}/share/man --sysconfdir=/etc - cd obj-qdbm CPPFLAGS=-I/usr/include/qdbm CFLAGS=$(CFLAGS) ../configure --with-database=qdbm --program-suffix=-qdbm \ + cd obj-qdbm CPPFLAGS=-I/usr/include/qdbm CFLAGS=$(CFLAGS) ../configure --with-database=qdbm --program-suffix=-qdbm --with-charset=utf-8 \ --prefix=/usr --mandir=\$${prefix}/share/man --sysconfdir=/etc - cd obj-sqlite CFLAGS=$(CFLAGS) ../configure --with-database=sqlite --program-suffix=-sqlite \ + cd obj-sqlite CFLAGS=$(CFLAGS) ../configure --with-database=sqlite --program-suffix=-sqlite --with-charset=utf-8 \ --prefix=/usr --mandir=\$${prefix}/share/man --sysconfdir=/etc \ sed -i 's/^INTEGRITY_TESTS.*/INTEGRITY_TESTS=t.lock1/' src/tests/Makefile Seulement dans bogofilter-1.1.3-mine: obj-db Seulement dans bogofilter-1.1.3-mine: obj-qdbm Seulement dans bogofilter-1.1.3-mine: obj-sqlite
Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
charset_default=utf-8 unicode=yes Isn't unicode=yes already the default? Nope, iso-8859-1 is (see configure.ac). But it could be by applying the attached patch (yes, I had to fix the configure.ac script). But actually, text tools should rather use the current locale's charset (from nl_langinfo(CODESET)), instead of hardcoding it in configuration files... We are talking about two different things. unicode=yes/no sets the charset used in the database. charset_default sets the charset assumed for messages without proper headers. I have seen no instances of mail in the wild where the charset was unspecified yet was actually proper UTF-8. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
Clint Adams, le Tue 26 Dec 2006 19:26:48 -0500, a écrit : charset_default=utf-8 unicode=yes Isn't unicode=yes already the default? Nope, iso-8859-1 is (see configure.ac). But it could be by applying the attached patch (yes, I had to fix the configure.ac script). But actually, text tools should rather use the current locale's charset (from nl_langinfo(CODESET)), instead of hardcoding it in configuration files... We are talking about two different things. unicode=yes/no sets the charset used in the database. Ah, sorry. Yes, unicode is the default. charset_default sets the charset assumed for messages without proper headers. I have seen no instances of mail in the wild where the charset was unspecified yet was actually proper UTF-8. Ah, ok, sorry, then the bug is probably not valid, I guess. Samuel
Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
Having lines charset_default=utf-8 unicode=yes Isn't unicode=yes already the default? -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default
Package: bogofilter Version: 1.1.3-1 Severity: serious I report this as serious because this _should_ be fixed before Etch is released. This bug causes bogofilter to work incorrectly in UTF-8 systems (which is Etch's default). Debian Etch uses UTF-8 locales and charset as default. Bogofilter uses ISO-8859-1 as the system default. This usually causes garbage words to user's ~/.bogofilter/wordlist.db since the default charset for _database_ is Unicode/UTF-8. To reproduce: Use UTF-8 locale: $ locale LANG=fi_FI.UTF-8 LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=fi_FI.UTF-8 LC_TIME=fi_FI.UTF-8 LC_COLLATE=fi_FI.UTF-8 LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=fi_FI.UTF-8 LC_NAME=fi_FI.UTF-8 LC_ADDRESS=fi_FI.UTF-8 LC_TELEPHONE=fi_FI.UTF-8 LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=fi_FI.UTF-8 LC_ALL= Use Bogofilter's default system charset (or define it to some other 8 bit charset): charset_default=iso-8859-1 Use Bogofilter's default word database charset (Unicode/UTF-8): unicode=yes (These can be defined in /etc/bogofilter.cf or ~/.bogofilter.cf) Some background information: letter ä is U+00E4 LATIN SMALL LETTER A WITH DIAERESIS and in UTF-8 encoding it takes two bytes: $c3 $a4. $ mv ~/.bogofilter/wordlist.db ~/wordlist.db-backup $ echo äiti | bogofilter -n $ bogoutil -d ~/.bogofilter/wordlist.db head:äiti 0 1 20061213 This example shows that the letter ä is encoded _twice_ with UTF-8. The command echo prints letter ä encoded with UTF-8, Bogofilter thinks it is in ISO-8859-1 and encodes both bytes separately: $c3 becomes à (U+00C3 LATIN CAPITAL LETTER A WITH TILDE) and $a4 becomes ¤ (U+00A4 CURRENCY SIGN). Having lines charset_default=utf-8 unicode=yes in /etc/bogofilter.cf file characters are encoded correctly. -- System Information: Debian Release: 4.0 APT prefers testing APT policy: (900, 'testing') Architecture: i386 (i686) Shell: /bin/sh linked to /bin/dash Kernel: Linux 2.6.18-3-k7 Locale: LANG=fi_FI.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8) Versions of packages bogofilter depends on: ii bogofilter-bdb1.1.3-1a fast Bayesian spam filter (Berke bogofilter recommends no packages. -- no debconf information