Re: [CODE4LIB] Character problems with tictoc

2009-12-22 Thread Bucknell, Terry
Thanks to everyone to drawing our attention to this issue. A couple of days ago the ticTOCs service moved to a new server where the data is stored as UTF-8 (which it wasn't before). We'd forgotten to remove the UFT-8 conversion in text.php so we were serving double-encoded content (UTF-8

Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Godmar Back
The string in question is double-encoded, that is, a string that's in UTF-8 already was run through a UTF-8 encoder. The string is Acta Ortopedica where the 'e' is really '\u00e9' aka 'Latin Small Letter E with Acute'. [1] In UTF-8, the e-acute is two-byte encoded as C3 A9. If you run the bytes

Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Character problems with tictoc Date: Mon, 21 Dec 2009 13:20:08 -0500 Message-ID: 719dced30912211020y7b726c83jc54d0fadcba92...@mail.gmail.com The string in question is double-encoded, that is, a string that's in UTF-8 already was run through a UTF-8

Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
Subject: Re: [CODE4LIB] Character problems with tictoc Date: Mon, 21 Dec 2009 13:20:08 -0500 Message-ID: 719dced30912211020y7b726c83jc54d0fadcba92...@mail.gmail.com The string in question is double-encoded, that is, a string that's in UTF-8 already was run through a UTF-8 encoder. The string

Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
is UTF-8... -- Thanks to all who helped (on- and off-list), Glen -- From: Erik Hetzner erik.hetz...@ucop.edu Sender: Code for Libraries CODE4LIB@LISTSERV.ND.EDU To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Character problems

Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Erik Hetzner
At Mon, 21 Dec 2009 14:59:01 -0500, Glen Newton wrote: Thanks, Erik, some useful tools and advice. Glad to help! […] But I don't understand why Firefox was ignoring the Content-Type: text/plain; charset=utf-8 It should not be using the default charset (ISO-Latin 8859-1) for this

Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
: Code for Libraries CODE4LIB@LISTSERV.ND.EDU To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Character problems with tictoc Date: Mon, 21 Dec 2009 12:14:54 -0800 Message-ID: p-irc-exbe01xjmxehy1...@ex.ucop.edu At Mon, 21 Dec 2009 14:59:01 -0500, Glen Newton wrote

Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Godmar Back
I believe they've changed it while we were having the discussion. When I downloaded the file (with curl), it looked like this: 0020700 r t o p C etx B ) d i c a sp B r a 72 74 6f 70 c3 83 c2 a9 64 69 63 61 20 42 72 61 0020720 s i l e i r a ht

Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Godmar Back
On Mon, Dec 21, 2009 at 2:09 PM, Glen Newton glen.new...@nrc-cnrc.gc.ca wrote: The file I got with wget is:  http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt (Just to convince myself I'm not going nuts...) - this file, which Glen downloaded with wget, appears double-encoded: # curl -s

Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
I agree with Godmar: it looks like (some) change happened to tictocs between my original wget download and the one I downloaded after I changed my browser settings. It appears Godmar is not going nuts (or at least this issue is not due to him going nuts!) ;-) Viewing the file