For more super neato reading! http://www.unicode.org/faq/utf_bom.html
On 4/20/05, Brice Ruth <[EMAIL PROTECTED]> wrote: > From the link you sent, it appears that UTF-8, UTF-16, and UTF-32 are > all just varying ways of storing the same glyphs. The reason there are > variations is because some systems need to process characters/glyphs > in single byte increments, so UTF-8 is good for that. Then, some > systems handle the double-byte UTF-16 just fine, so that's great ... > it was the original format, anyhow. UTF-32 came about, apparently, > because UTF-16 filled up, so they created 'surrogates' which is > essentially two UTF-16 characters ... so, some systems can't > understand these surrogates, so UTF-32 is a fixed representation of > the full UCS-4 data space. > > Now, most folks aren't using UTF-32 because its HUGE for data storage > ... imagine all your databases doubling in size, effectly (if you were > using UTF-16 already) - or quadrupling (if you used UTF-8 and you > averaged 1.1 bytes or so for your representations). Big difference. > > Anyhow, that's what I got out of the informative link Brandon sent ... > its a good read, for sure. What I didn't get, is that any difference > in *what* could be represented, exists between UTF-8, 16, and 32. > > Cheers! > > On 4/20/05, Brandon Goodin <[EMAIL PROTECTED]> wrote: > > I did work with a japanese site and we used Shift_JIS which is a UTF-8 > > extension. We would store Shift_JIS into the database but then we had > > some issues reading the stored data from the database. The characters > > were entered as Shift_JIS and stored as UCS-2 (UTF-16) in SQL Server. > > We tried reading them straight from the database and displaying them > > on screen without any byte encoding conversion. But, they wound up > > looking all wrong. The browser did not handle the conversion properly. > > We then read the data from the database and used the java > > String.getBytes(String charSetName) method to reset the encoding. > > However, the java String.getBytes method did not work properly. We > > wound up writing our own conversion that was quite simple and > > everything worked. So, as far as i know, all the glyph representations > > that are available in UTF-8 are available to UTF-16 and it is possible > > to convert back and forth between the two so long as a glyph does not > > exceed UTF-8 glyph storage size. But, I think UTF-16 has the potential > > to store more complex glyphs. Maybe i'm wrong. But, that is my > > impression with all of this. > > > > Brandon > > > > On 4/20/05, Miquel Angel Bada Zuazo <[EMAIL PROTECTED]> wrote: > > > UTF-8 is for almost all languajes (uses 8 bits for representing a > > > letter I think), but "complicated" languajes as Japanese and Thailand > > > uses 16 bits, so that's because of UTF-16 overall. > > > > > > Miquel Angel > > > > > > On 4/20/05, Brandon Goodin <[EMAIL PROTECTED]> wrote: > > > > I've done quite a bit with i18n working between UTF-8 and UTF-16. Even > > > > after all that... I'm still mystified. :D Encoding is a world unto > > > > itself. All i want is something that works :) Maybe one of these days > > > > i'll understand more... for now it's all about trial and error. > > > > > > > > On 4/20/05, Brice Ruth <[EMAIL PROTECTED]> wrote: > > > > > I don't see anywhere in there that UTF-8 cannot encode everything that > > > > > UTF-16 and UTF-32 can ... just that the storage requirements differ ?! > > > > > > > > > > Brice > > > > > > > > > > On 4/20/05, Brandon Goodin <[EMAIL PROTECTED]> wrote: > > > > > > http://icu.sourceforge.net/docs/papers/forms_of_unicode/ > > > > > > > > > > > > On 4/20/05, Brice Ruth <[EMAIL PROTECTED]> wrote: > > > > > > > I had heard that chinese does a lot with UTF-16, but I hadn't > > > > > > > heard > > > > > > > about arabic ... and I don't exactly understand why UTF-8 doesn't > > > > > > > support that ... is it simply because their character sets keep > > > > > > > expanding and UTF-8 is static? > > > > > > > > > > > > > > On 4/20/05, Brandon Goodin <[EMAIL PROTECTED]> wrote: > > > > > > > > Latin characters are fine. Howeve, UTF-8 is not sufficient for > > > > > > > > several > > > > > > > > languages like Arabic and Chinese. For their FULL range of > > > > > > > > character > > > > > > > > representaions these languages require UTF-16 and in the case of > > > > > > > > Chinese it is pushing for UTF-32. > > > > > > > > > > > > > > > > Brandon > > > > > > > > > > > > > > > > On 4/20/05, Brice Ruth <[EMAIL PROTECTED]> wrote: > > > > > > > > > OK ... that's more reasonable. Obviously, you need to use an > > > > > > > > > editor > > > > > > > > > (such as Eclipse) that is capable of editing UTF-8 files, > > > > > > > > > otherwise, > > > > > > > > > you'll get junk and that won't be fun. > > > > > > > > > > > > > > > > > > Whew ... glad UTF-8 isn't compromised :) > > > > > > > > > > > > > > > > > > On 4/20/05, Brandon Goodin <[EMAIL PROTECTED]> wrote: > > > > > > > > > > I found this quote when doing a search in google: > > > > > > > > > > > > > > > > > > > > --- quote --- > > > > > > > > > > > > > > > > > > > > Your actual problem is very typical. By default (without > > > > > > > > > > encoding > > > > > > > > > > specified in the XML declaration), XML is encoded in UTF-8. > > > > > > > > > > If you use > > > > > > > > > > an editor which is not encoding-aware and typically > > > > > > > > > > assuming an > > > > > > > > > > ISO-8859-1 encoding, and you insert characters such as > > > > > > > > > > accented > > > > > > > > > > letters, curly quotes, etc., you will get this error. As a > > > > > > > > > > workaround, > > > > > > > > > > you can put an XML declaration with the ISO-8859-1 encoding > > > > > > > > > > at the top > > > > > > > > > > of your XML file: > > > > > > > > > > > > > > > > > > > > <?xml version="1.0" encoding="ISO-8859-1"?> > > > > > > > > > > > > > > > > > > > > You can also use an editor which knows how to handle UTF-8. > > > > > > > > > > > > > > > > > > > > In your case it is also possible that somebody inserted > > > > > > > > > > incorrect > > > > > > > > > > characters by accident, and you can just remove those and > > > > > > > > > > then decide > > > > > > > > > > which encoding you want to use. UTF-8 gives you the whole > > > > > > > > > > range of > > > > > > > > > > Unicode, while ISO-8859-1 gives you a limited set of > > > > > > > > > > characters that > > > > > > > > > > work for the Western languages. > > > > > > > > > > > > > > > > > > > > --- quote --- > > > > > > > > > > > > > > > > > > > > maybe that will help, > > > > > > > > > > Brandon > > > > > > > > > > > > > > > > > > > > On 4/20/05, Brice Ruth <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > What special characters aren't supported by UTF-8?! I > > > > > > > > > > > have never heard > > > > > > > > > > > of such a thing. My understanding is that UTF-8 > > > > > > > > > > > represents the full > > > > > > > > > > > Unicode character set as a multi-byte value. And since > > > > > > > > > > > Unicode is > > > > > > > > > > > supposed to encompass all known characters for all known > > > > > > > > > > > languages > > > > > > > > > > > (with space for new Chinese characters created daily) - > > > > > > > > > > > what's not > > > > > > > > > > > covered?! > > > > > > > > > > > > > > > > > > > > > > There most certainly shouldn't be anything that > > > > > > > > > > > iso-8859-1 or latin1 > > > > > > > > > > > (Windows-1252) covers that is not in Unicode. > > > > > > > > > > > > > > > > > > > > > > Brice > > > > > > > > > > > > > > > > > > > > > > On 4/20/05, Daniel H. F. e Silva <[EMAIL PROTECTED]> > > > > > > > > > > > wrote: > > > > > > > > > > > > You could check also your xml encoding. If you work > > > > > > > > > > > > with special charaters not in utf-8, you will > > > > > > > > > > > > get in trouble. > > > > > > > > > > > > I had this as my native language is portuguese and we > > > > > > > > > > > > have some special characters not supported > > > > > > > > > > > > by utf-8. > > > > > > > > > > > > So, if this is your case, try iso-8859-1 or one that > > > > > > > > > > > > fits better to your needs. > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Daniel Silva. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- Larry Meadors <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > Make sure that there is no white space and no odd > > > > > > > > > > > > > chars at the top of your > > > > > > > > > > > > > config file. > > > > > > > > > > > > > > > > > > > > > > > > > > Larry > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 4/18/05, KK <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > I get the following error when I try to build > > > > > > > > > > > > > > sqlCOnfigmap..does it > > > > > > > > > > > > > > look familiar to someone? > > > > > > > > > > > > > > > > > > > > > > > > > > > > com.ibatis.sqlmap.client.SqlMapException: There was > > > > > > > > > > > > > > an error while > > > > > > > > > > > > > > building the SqlMap instance. > > > > > > > > > > > > > > --- The error occurred in the SQL Map Configuration > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > --- Cause: > > > > > > > > > > > > > > com.ibatis.sqlmap.client.SqlMapException: XML > > > > > > > > > > > > > > Parser Error. > > > > > > > > > > > > > > Cause: java.io.UTFDataFormatException: Invalid byte > > > > > > > > > > > > > > 3 of 3-byte UTF-8 > > > > > > > > > > > > > > sequence. > > > > > > > > > > > > > > Caused by: java.io.UTFDataFormatException: Invalid > > > > > > > > > > > > > > byte 3 of 3-byte > > > > > > > > > > > > > > UTF-8 sequence. > > > > > > > > > > > > > > Caused by: > > > > > > > > > > > > > > com.ibatis.sqlmap.client.SqlMapException: XML > > > > > > > > > > > > > > Parser Error. > > > > > > > > > > > > > > Cause: java.io.UTFDataFormatException: Invalid byte > > > > > > > > > > > > > > 3 of 3-byte UTF-8 > > > > > > > > > > > > > > sequence. > > > > > > > > > > > > > > Caused by: java.io.UTFDataFormatException: Invalid > > > > > > > > > > > > > > byte 3 of 3-byte > > > > > > > > > > > > > > UTF-8 sequence. > > > > > > > > > > > > > > at > > > > > > > > > > > > > > com.ibatis.sqlmap.engine.builder.xml.XmlSqlMapClientBuilder.buildSqlMap > > > > > > > > > > > > > > (XmlSqlMapClientBuilder.java:203) > > > > > > > > > > > > > > at com.ibatis.sqlmap.client. > > > > > > > > > > > > > > SqlMapClientBuilder.buildSqlMapClient(SqlMapClientBuilder.java:49) > > > > > > > > > > > > > > > > > > > > > > > > > > > > Your help is greatly appreciated. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > KK > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > __________________________________________________ > > > > > > > > > > > > Do You Yahoo!? > > > > > > > > > > > > Tired of spam? Yahoo! Mail has the best spam > > > > > > > > > > > > protection around > > > > > > > > > > > > http://mail.yahoo.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Brice Ruth > > > > > > > > > > > Software Engineer, Madison WI > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Brice Ruth > > > > > > > > > Software Engineer, Madison WI > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Brice Ruth > > > > > > > Software Engineer, Madison WI > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Brice Ruth > > > > > Software Engineer, Madison WI > > > > > > > > > > > > > > > > -- > Brice Ruth > Software Engineer, Madison WI >