That's probably what I'll end up doing... but it's going to be a
speed hit, I'm guessing. Shouldn't decodeURLComponent do this, or
at least have a setting to indicate how the %xx entities are encoded?
I am sure that you are right and that it is a bug...
However, I encountered a bug with this before (on Linux) and it was a
trivial exercise to write your own version which handles encodings
properly. Just use MemoryBlocks (or the in-memory BinaryStream)
copying byte values until you find a % character, and then convert
the following two bytes into a single byte.
Thanks, Phil... I'll give that a try.
Just remember that a generic URLComponent string is suppose to be
encoding-free (undefined).
Well, according to what I've seen in researching it, the URLComponent
string should actually be US-ASCII, and I think REALbasic is
identifying it as such.
Even though you looked up the proper way
to encode UTF-8 text, there is no encoding tag to identify it as
such. Therefore it is up to you to identify the text as UTF-8, which
is pretty hard to do without a validator...
Right, but the encoded string ("M%FC%21") is US-ASCII, and
decodeURLComponent decodes those %xx entities according to the
MacRoman encoding (on my system, anyway). DecodeURLComponent sees the
"%FC" entity, looks it up on the MacRoman chart (http://
2shortplanks.com/unicode/charts/MacRoman.html), and says, "Say,
that's a cedilla!" even though I've said nothing in code about
anything being in MacRoman. I'm assuming it chooses MacRoman because
that's my system's default encoding... and it doesn't appear that I
have any control over that.
It seems to me that I need to tell decodeURLComponent to use the
secret decoder wheel marked "UTF-8" instead of the one marked
"MacRoman", which is what it's doing. Any attempt to define or
convert encoding after the fact seems fruitless, because the text has
already been mangled. (Unless there is a method that says "assume
this string has been decoded incorrectly from hex entities using
encoding X, and re-decode it to encoding Y".)
unless you are in a
closed loop system where you know all data is being included as
UTF-8. I would guess that web browsers send data in the encoding
defined by the web page,
Actually, this doesn't appear to be the case; the form I'm using has
a content-type of "text/html; charset=iso-8859-1", of which MacRoman
is a subset. Changing this line to "charset=utf-8" has no effect on
the hex-encoding of the query-string that gets sent to the CGI; it's
sent using the UTF-8 tables regardless.
but I wouldn't be surprised if some browsers
are not UTF-8 aware.
Neither would I, but every one I've checked so far *does* send the
data encoded according to UTF-8.
***************************************************
Toby W. Rush - [EMAIL PROTECTED]
Instructor of Music Theory
PVA Webmaster & Technical Operations Manager
University of Northern Colorado
"Omnia voluntaria est."
***************************************************
_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>
Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>