[AFFUG Discuss] UTF-16 encoded data and BOM issue

Charlie Hubbard Mon, 17 Aug 2009 06:50:59 -0700

I have a file I'm reading that has segments of potentially UTF-16, UTF-8,
etc encoding in various fields.  I am reading these fields and getting the
right encoding, but when I encounter a UTF encoded field they start with a
BOM (0xfffe or 0xfeff).  This is important for UTF-16 so it can figure out
the byte order and recognize the encoding as UTF-16 data.  However, when I
read the UTF string in using:


var title : String = stream.readMultiByte( size, textEncoding );

The BOM is included in the title string.  I figured adobe would strip that
off, but it's making into the final string.  So to illustrate what I mean if
the word is "Stay" if I do the following code:

assertEquals( "Stay".length, title.length );

It will fail saying 5 is not equal to 4!  Even though when you look at the
string in the debugger it says "Stay" for title.  However, when I send that
across the wire I can see the BOM is included in that title.  So my question
is am I reading UTF data correctly or not?  Is there a generally prescribed
way of getting rid of the BOM?

I do have two text encodings I'm looking for "utf-16" and "utf-16be" for
little endian utf-16 and big endian utf-16 respectively.  And it appears
that the encoding field matches the utf-16/utf-16be according to the BOM.
I'm not sure if I should parse the BOM myself, and select either encoding
based on the BOM's value or not.

Thanks
Charlie

[AFFUG Discuss] UTF-16 encoded data and BOM issue

Reply via email to