Re: decodeURLComponent and non-ASCII characters

Toby Rush Fri, 06 Oct 2006 12:34:23 -0700

That's probably what I'll end up doing... but it's going to be a
speed hit, I'm guessing. Shouldn't decodeURLComponent do this, or
at least have a setting to indicate how the %xx entities are encoded?


I am sure that you are right and that it is a bug...

However, I encountered a bug with this before (on Linux) and it was a
trivial exercise to write your own version which handles encodings
properly.  Just use MemoryBlocks (or the in-memory BinaryStream)
copying byte values until you find a % character, and then convert
the following two bytes into a single byte.


Thanks, Phil... I'll give that a try.

Just remember that a generic URLComponent string is suppose to be
encoding-free (undefined).

Well, according to what I've seen in researching it, the URLComponentstring should actually be US-ASCII, and I think REALbasic isidentifying it as such.

Even though you looked up the proper way
to encode UTF-8 text, there is no encoding tag to identify it as
such.  Therefore it is up to you to identify the text as UTF-8, which
is pretty hard to do without a validator...

Right, but the encoded string ("M%FC%21") is US-ASCII, anddecodeURLComponent decodes those %xx entities according to theMacRoman encoding (on my system, anyway). DecodeURLComponent sees the"%FC" entity, looks it up on the MacRoman chart (http://2shortplanks.com/unicode/charts/MacRoman.html), and says, "Say,that's a cedilla!" even though I've said nothing in code aboutanything being in MacRoman. I'm assuming it chooses MacRoman becausethat's my system's default encoding... and it doesn't appear that Ihave any control over that.

It seems to me that I need to tell decodeURLComponent to use thesecret decoder wheel marked "UTF-8" instead of the one marked"MacRoman", which is what it's doing. Any attempt to define orconvert encoding after the fact seems fruitless, because the text hasalready been mangled. (Unless there is a method that says "assumethis string has been decoded incorrectly from hex entities usingencoding X, and re-decode it to encoding Y".)

unless you are in a
closed loop system where you know all data is being included as
UTF-8.  I would guess that web browsers send data in the encoding
defined by the web page,

Actually, this doesn't appear to be the case; the form I'm using hasa content-type of "text/html; charset=iso-8859-1", of which MacRomanis a subset. Changing this line to "charset=utf-8" has no effect onthe hex-encoding of the query-string that gets sent to the CGI; it'ssent using the UTF-8 tables regardless.

but I wouldn't be surprised if some browsers
are not UTF-8 aware.

Neither would I, but every one I've checked so far *does* send thedata encoded according to UTF-8.


***************************************************
Toby W. Rush - [EMAIL PROTECTED]
Instructor of Music Theory
PVA Webmaster & Technical Operations Manager
University of Northern Colorado
"Omnia voluntaria est."
***************************************************

_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Re: decodeURLComponent and non-ASCII characters

Reply via email to