Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Teemu Likonen
Clint Adams kirjoitti (26.12.2006 klo 19.26):

 charset_default sets the charset assumed for messages without proper
 headers.  I have seen no instances of mail in the wild where the
 charset was unspecified yet was actually proper UTF-8.

I didn't know that bogofilter is able to check message headers for
correct encoding. I use KMail (KDE's email client) and it converts
messages to locale charset before sending them to bogofilter. How do
other programs behave? What is the correct behaviour (if there is one)?

If this is just KMail's problem that bogofilter database gets
(practically) corrupted when locale's charset is different than
bogofilter's charset_default, then, yes, this is not a bogofilter bug.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Teemu Likonen
Clint Adams kirjoitti (24.12.2006 klo 9.23):

  Having lines
charset_default=utf-8
unicode=yes
 
 Isn't unicode=yes already the default?

Yes, it is the default. I think it's a good idea to define unicode=yes
explicitly because defaults may change (in this case, I don't believe it
will, though).


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Samuel Thibault
Teemu Likonen, le Wed 27 Dec 2006 10:23:39 +0200, a écrit :
 I didn't know that bogofilter is able to check message headers for
 correct encoding. I use KMail (KDE's email client) and it converts
 messages to locale charset before sending them to bogofilter. How do
 other programs behave? What is the correct behaviour (if there is one)?

I'd say the correct behavior is to just keep the message intact.

Samuel



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Teemu Likonen
Samuel Thibault kirjoitti (27.12.2006 klo 10.49):

 Teemu Likonen, le Wed 27 Dec 2006 10:23:39 +0200, a écrit :
  I didn't know that bogofilter is able to check message headers for
  correct encoding. I use KMail (KDE's email client) and it converts
  messages to locale charset before sending them to bogofilter. How do
  other programs behave? What is the correct behaviour (if there is
  one)?
 
 I'd say the correct behavior is to just keep the message intact.

I checked how bogofilter works with messages with different encodings
and Content-Type headers. Bogofilter works as it should: it checks the
message's Content-Type header and get's the charset from there. With
unicode=yes (which is the default) bogofilter converts the message to
UTF-8 and stores words to it's database.

If charset is not defined in message's Content-Type headers, bogofilter
uses it's own charset_default setting (default is ISO-8859-1). I think
ISO-8859-1 is a good default: I believe most of the messages without
Content-Type headers are in some kind of Western European charset.
Probably most of the spam is English.

So, my bug report was pretty pointless from bogofilter's point of view.
:) I guess this bug can be closed. At least I downgraded the severity to
normal.

There remains this KMail problem, though. Maybe it's worth filing a new
report.



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Samuel Thibault
Teemu Likonen, le Thu 28 Dec 2006 00:16:20 +0200, a écrit :
 If charset is not defined in message's Content-Type headers, bogofilter
 uses it's own charset_default setting (default is ISO-8859-1). I think
 ISO-8859-1 is a good default: I believe most of the messages without
 Content-Type headers are in some kind of Western European charset.
 Probably most of the spam is English.

Maybe cp1252 would even be more useful, since it is an over-set of
iso-8859-1 and it is used by a lot of mailers running on another
well-known OS.

Samuel



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Teemu Likonen
Samuel Thibault kirjoitti (27.12.2006 klo 23.25):

 Maybe cp1252 would even be more useful, since it is an over-set of
 iso-8859-1 and it is used by a lot of mailers running on another
 well-known OS.

Indeed. Then it would be charset_default=Windows-1252 or
charset_default=cp1252.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-26 Thread Samuel Thibault
Clint Adams, le Sun 24 Dec 2006 09:23:44 -0500, a écrit :
  Having lines
charset_default=utf-8
unicode=yes
 
 Isn't unicode=yes already the default?

Nope, iso-8859-1 is (see configure.ac).  But it could be by applying the
attached patch (yes, I had to fix the configure.ac script).

But actually, text tools should rather use the current locale's charset
(from nl_langinfo(CODESET)), instead of hardcoding it in configuration
files...

Samuel
diff -ur bogofilter-1.1.3/configure bogofilter-1.1.3-mine/configure
--- bogofilter-1.1.3/configure  2006-12-03 05:17:15.0 +0100
+++ bogofilter-1.1.3-mine/configure 2006-12-27 01:06:32.0 +0100
@@ -6137,6 +6137,7 @@
 #define DEFAULT_CHARSET $withval
 _ACEOF
 
+   DEFAULT_CHARSET=$withval
 
 fi
 
diff -ur bogofilter-1.1.3/configure.ac bogofilter-1.1.3-mine/configure.ac
--- bogofilter-1.1.3/configure.ac   2006-12-03 04:55:30.0 +0100
+++ bogofilter-1.1.3-mine/configure.ac  2006-12-27 01:05:28.0 +0100
@@ -336,6 +336,7 @@
AC_DEFINE_UNQUOTED(DEFAULT_CHARSET, 
[$withval], 
[Use specified default charset instead of iso-8859-1])
+   [DEFAULT_CHARSET=$withval]
 )
 
 AC_SUBST(ENCODING)
Seulement dans bogofilter-1.1.3-mine: configure-stamp
diff -ur bogofilter-1.1.3/debian/rules bogofilter-1.1.3-mine/debian/rules
--- bogofilter-1.1.3/debian/rules   2006-12-27 01:05:50.0 +0100
+++ bogofilter-1.1.3-mine/debian/rules  2006-12-27 01:09:19.0 +0100
@@ -26,11 +26,11 @@
 
$(INSTALL) -d obj-db obj-qdbm obj-sqlite
 
-   cd obj-db  CFLAGS=$(CFLAGS) ../configure --with-database=db \
+   cd obj-db  CFLAGS=$(CFLAGS) ../configure --with-database=db 
--with-charset=utf-8 \
--prefix=/usr --mandir=\$${prefix}/share/man --sysconfdir=/etc
-   cd obj-qdbm  CPPFLAGS=-I/usr/include/qdbm CFLAGS=$(CFLAGS) 
../configure --with-database=qdbm --program-suffix=-qdbm \
+   cd obj-qdbm  CPPFLAGS=-I/usr/include/qdbm CFLAGS=$(CFLAGS) 
../configure --with-database=qdbm --program-suffix=-qdbm --with-charset=utf-8 \
--prefix=/usr --mandir=\$${prefix}/share/man --sysconfdir=/etc
-   cd obj-sqlite  CFLAGS=$(CFLAGS) ../configure --with-database=sqlite 
--program-suffix=-sqlite \
+   cd obj-sqlite  CFLAGS=$(CFLAGS) ../configure --with-database=sqlite 
--program-suffix=-sqlite --with-charset=utf-8 \
--prefix=/usr --mandir=\$${prefix}/share/man --sysconfdir=/etc 
 \
sed -i 's/^INTEGRITY_TESTS.*/INTEGRITY_TESTS=t.lock1/' 
src/tests/Makefile
 
Seulement dans bogofilter-1.1.3-mine: obj-db
Seulement dans bogofilter-1.1.3-mine: obj-qdbm
Seulement dans bogofilter-1.1.3-mine: obj-sqlite


Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-26 Thread Clint Adams
 charset_default=utf-8
 unicode=yes
  
  Isn't unicode=yes already the default?
 
 Nope, iso-8859-1 is (see configure.ac).  But it could be by applying the
 attached patch (yes, I had to fix the configure.ac script).
 
 But actually, text tools should rather use the current locale's charset
 (from nl_langinfo(CODESET)), instead of hardcoding it in configuration
 files...

We are talking about two different things.  unicode=yes/no sets the
charset used in the database.  charset_default sets the charset assumed
for messages without proper headers.  I have seen no instances of mail
in the wild where the charset was unspecified yet was actually proper
UTF-8.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-26 Thread Samuel Thibault
Clint Adams, le Tue 26 Dec 2006 19:26:48 -0500, a écrit :
  charset_default=utf-8
  unicode=yes
   
   Isn't unicode=yes already the default?
  
  Nope, iso-8859-1 is (see configure.ac).  But it could be by applying the
  attached patch (yes, I had to fix the configure.ac script).
  
  But actually, text tools should rather use the current locale's charset
  (from nl_langinfo(CODESET)), instead of hardcoding it in configuration
  files...
 
 We are talking about two different things.  unicode=yes/no sets the
 charset used in the database.

Ah, sorry.  Yes, unicode is the default.

 charset_default sets the charset assumed for messages without proper
 headers.  I have seen no instances of mail in the wild where the
 charset was unspecified yet was actually proper UTF-8.

Ah, ok, sorry, then the bug is probably not valid, I guess.

Samuel



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-24 Thread Clint Adams
 Having lines
   charset_default=utf-8
   unicode=yes

Isn't unicode=yes already the default?


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-13 Thread Teemu Likonen
Package: bogofilter
Version: 1.1.3-1
Severity: serious

I report this as serious because this _should_ be fixed before Etch is
released. This bug causes bogofilter to work incorrectly in UTF-8
systems (which is Etch's default).

Debian Etch uses UTF-8 locales and charset as default. Bogofilter uses
ISO-8859-1 as the system default. This usually causes garbage words to
user's ~/.bogofilter/wordlist.db since the default charset for
_database_ is Unicode/UTF-8.

To reproduce:


Use UTF-8 locale:

$ locale
LANG=fi_FI.UTF-8
LC_CTYPE=fi_FI.UTF-8
LC_NUMERIC=fi_FI.UTF-8
LC_TIME=fi_FI.UTF-8
LC_COLLATE=fi_FI.UTF-8
LC_MONETARY=fi_FI.UTF-8
LC_MESSAGES=en_US.UTF-8
LC_PAPER=fi_FI.UTF-8
LC_NAME=fi_FI.UTF-8
LC_ADDRESS=fi_FI.UTF-8
LC_TELEPHONE=fi_FI.UTF-8
LC_MEASUREMENT=fi_FI.UTF-8
LC_IDENTIFICATION=fi_FI.UTF-8
LC_ALL=


Use Bogofilter's default system charset (or define it to some
other 8 bit charset):
  charset_default=iso-8859-1
Use Bogofilter's default word database charset (Unicode/UTF-8):
  unicode=yes

(These can be defined in /etc/bogofilter.cf or ~/.bogofilter.cf)


Some background information: letter ä is U+00E4 LATIN SMALL LETTER
A WITH DIAERESIS and in UTF-8 encoding it takes two bytes: $c3 $a4.


$ mv ~/.bogofilter/wordlist.db ~/wordlist.db-backup
$ echo äiti | bogofilter -n
$ bogoutil -d ~/.bogofilter/wordlist.db
head:äiti 0 1 20061213

This example shows that the letter ä is encoded _twice_ with UTF-8.
The command echo prints letter ä encoded with UTF-8, Bogofilter
thinks it is in ISO-8859-1 and encodes both bytes separately: $c3
becomes à (U+00C3 LATIN CAPITAL LETTER A WITH TILDE) and $a4 becomes
¤ (U+00A4 CURRENCY SIGN).

Having lines
  charset_default=utf-8
  unicode=yes
in /etc/bogofilter.cf file characters are encoded correctly.


-- System Information:
Debian Release: 4.0
  APT prefers testing
  APT policy: (900, 'testing')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/dash
Kernel: Linux 2.6.18-3-k7
Locale: LANG=fi_FI.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8)

Versions of packages bogofilter depends on:
ii  bogofilter-bdb1.1.3-1a fast Bayesian spam filter (Berke

bogofilter recommends no packages.

-- no debconf information