Thanks to everyone to drawing our attention to this issue.
A couple of days ago the ticTOCs service moved to a new server where the data
is stored as UTF-8 (which it wasn't before). We'd forgotten to remove the UFT-8
conversion in text.php so we were serving double-encoded content (UTF-8
The string in question is double-encoded, that is, a string that's in
UTF-8 already was run through a UTF-8 encoder.
The string is Acta Ortopedica where the 'e' is really '\u00e9' aka
'Latin Small Letter E with Acute'. [1]
In UTF-8, the e-acute is two-byte encoded as C3 A9. If you run the
bytes
@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Character problems with tictoc
Date: Mon, 21 Dec 2009 13:20:08 -0500
Message-ID: 719dced30912211020y7b726c83jc54d0fadcba92...@mail.gmail.com
The string in question is double-encoded, that is, a string that's in
UTF-8 already was run through a UTF-8
Subject: Re: [CODE4LIB] Character problems with tictoc
Date: Mon, 21 Dec 2009 13:20:08 -0500
Message-ID: 719dced30912211020y7b726c83jc54d0fadcba92...@mail.gmail.com
The string in question is double-encoded, that is, a string that's in
UTF-8 already was run through a UTF-8 encoder.
The string
is UTF-8...
--
Thanks to all who helped (on- and off-list),
Glen
--
From: Erik Hetzner erik.hetz...@ucop.edu
Sender: Code for Libraries CODE4LIB@LISTSERV.ND.EDU
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Character problems
At Mon, 21 Dec 2009 14:59:01 -0500,
Glen Newton wrote:
Thanks, Erik, some useful tools and advice.
Glad to help!
[…]
But I don't understand why Firefox was ignoring the
Content-Type: text/plain; charset=utf-8
It should not be using the default charset (ISO-Latin 8859-1) for
this
: Code for Libraries CODE4LIB@LISTSERV.ND.EDU
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Character problems with tictoc
Date: Mon, 21 Dec 2009 12:14:54 -0800
Message-ID: p-irc-exbe01xjmxehy1...@ex.ucop.edu
At Mon, 21 Dec 2009 14:59:01 -0500,
Glen Newton wrote
I believe they've changed it while we were having the discussion.
When I downloaded the file (with curl), it looked like this:
0020700 r t o p C etx B ) d i c a sp B r a
72 74 6f 70 c3 83 c2 a9 64 69 63 61 20 42 72 61
0020720 s i l e i r a ht
On Mon, Dec 21, 2009 at 2:09 PM, Glen Newton glen.new...@nrc-cnrc.gc.ca wrote:
The file I got with wget is:
http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt
(Just to convince myself I'm not going nuts...) - this file, which
Glen downloaded with wget, appears double-encoded:
# curl -s
I agree with Godmar: it looks like (some) change happened to tictocs
between my original wget download and the one I downloaded after I
changed my browser settings.
It appears Godmar is not going nuts (or at least this issue is not due
to him going nuts!) ;-)
Viewing the file
10 matches
Mail list logo