Re: Help needed...

Brandon Goodin Wed, 20 Apr 2005 09:44:15 -0700

For more super neato reading!

http://www.unicode.org/faq/utf_bom.html


On 4/20/05, Brice Ruth <[EMAIL PROTECTED]> wrote:
> From the link you sent, it appears that UTF-8, UTF-16, and UTF-32 are
> all just varying ways of storing the same glyphs. The reason there are
> variations is because some systems need to process characters/glyphs
> in single byte increments, so UTF-8 is good for that. Then, some
> systems handle the double-byte UTF-16 just fine, so that's great ...
> it was the original format, anyhow. UTF-32 came about, apparently,
> because UTF-16 filled up, so they created 'surrogates' which is
> essentially two UTF-16 characters ... so, some systems can't
> understand these surrogates, so UTF-32 is a fixed representation of
> the full UCS-4 data space.
> 
> Now, most folks aren't using UTF-32 because its HUGE for data storage
> ... imagine all your databases doubling in size, effectly (if you were
> using UTF-16 already) - or quadrupling (if you used UTF-8 and you
> averaged 1.1 bytes or so for your representations). Big difference.
> 
> Anyhow, that's what I got out of the informative link Brandon sent ...
> its a good read, for sure. What I didn't get, is that any difference
> in *what* could be represented, exists between UTF-8, 16, and 32.
> 
> Cheers!
> 
> On 4/20/05, Brandon Goodin <[EMAIL PROTECTED]> wrote:
> > I did work with a japanese site and we used Shift_JIS which is a UTF-8
> > extension. We would store Shift_JIS into the database but then we had
> > some issues reading the stored data from the database. The characters
> > were entered as Shift_JIS and stored as UCS-2 (UTF-16) in SQL Server.
> > We tried reading them straight from the database and displaying them
> > on screen without any byte encoding conversion. But, they wound up
> > looking all wrong. The browser did not handle the conversion properly.
> > We then read the data from the database and used the java
> > String.getBytes(String charSetName) method to reset the encoding.
> > However, the java String.getBytes method did not work properly. We
> > wound up writing our own conversion that was quite simple and
> > everything worked. So, as far as i know, all the glyph representations
> > that are available in UTF-8 are available to UTF-16 and it is possible
> > to convert back and forth between the two so long as a glyph does not
> > exceed UTF-8 glyph storage size. But, I think UTF-16 has the potential
> > to store more complex glyphs. Maybe i'm wrong. But, that is my
> > impression with all of this.
> >
> > Brandon
> >
> > On 4/20/05, Miquel Angel Bada Zuazo <[EMAIL PROTECTED]> wrote:
> > > UTF-8 is for almost all languajes (uses 8 bits for representing a
> > > letter I think), but "complicated" languajes as Japanese and Thailand
> > > uses 16 bits, so that's because of UTF-16 overall.
> > >
> > > Miquel Angel
> > >
> > > On 4/20/05, Brandon Goodin <[EMAIL PROTECTED]> wrote:
> > > > I've done quite a bit with i18n working between UTF-8 and UTF-16. Even
> > > > after all that... I'm still mystified. :D Encoding is a world unto
> > > > itself. All i want is something that works :) Maybe one of these days
> > > > i'll understand more... for now it's all about trial and error.
> > > >
> > > > On 4/20/05, Brice Ruth <[EMAIL PROTECTED]> wrote:
> > > > > I don't see anywhere in there that UTF-8 cannot encode everything that
> > > > > UTF-16 and UTF-32 can ... just that the storage requirements differ ?!
> > > > >
> > > > > Brice
> > > > >
> > > > > On 4/20/05, Brandon Goodin <[EMAIL PROTECTED]> wrote:
> > > > > > http://icu.sourceforge.net/docs/papers/forms_of_unicode/
> > > > > >
> > > > > > On 4/20/05, Brice Ruth <[EMAIL PROTECTED]> wrote:
> > > > > > > I had heard that chinese does a lot with UTF-16, but I hadn't 
> > > > > > > heard
> > > > > > > about arabic ... and I don't exactly understand why UTF-8 doesn't
> > > > > > > support that ... is it simply because their character sets keep
> > > > > > > expanding and UTF-8 is static?
> > > > > > >
> > > > > > > On 4/20/05, Brandon Goodin <[EMAIL PROTECTED]> wrote:
> > > > > > > > Latin characters are fine. Howeve, UTF-8 is not sufficient for 
> > > > > > > > several
> > > > > > > > languages like Arabic and Chinese. For their FULL range of 
> > > > > > > > character
> > > > > > > > representaions these languages require UTF-16 and in the case of
> > > > > > > > Chinese it is pushing for UTF-32.
> > > > > > > >
> > > > > > > > Brandon
> > > > > > > >
> > > > > > > > On 4/20/05, Brice Ruth <[EMAIL PROTECTED]> wrote:
> > > > > > > > > OK ... that's more reasonable. Obviously, you need to use an 
> > > > > > > > > editor
> > > > > > > > > (such as Eclipse) that is capable of editing UTF-8 files, 
> > > > > > > > > otherwise,
> > > > > > > > > you'll get junk and that won't be fun.
> > > > > > > > >
> > > > > > > > > Whew ... glad UTF-8 isn't compromised :)
> > > > > > > > >
> > > > > > > > > On 4/20/05, Brandon Goodin <[EMAIL PROTECTED]> wrote:
> > > > > > > > > > I found this quote when doing a search in google:
> > > > > > > > > >
> > > > > > > > > > --- quote ---
> > > > > > > > > >
> > > > > > > > > > Your actual problem is very typical. By default (without 
> > > > > > > > > > encoding
> > > > > > > > > > specified in the XML declaration), XML is encoded in UTF-8. 
> > > > > > > > > > If you use
> > > > > > > > > > an editor which is not encoding-aware and typically 
> > > > > > > > > > assuming an
> > > > > > > > > > ISO-8859-1 encoding, and you insert characters such as 
> > > > > > > > > > accented
> > > > > > > > > > letters, curly quotes, etc., you will get this error. As a 
> > > > > > > > > > workaround,
> > > > > > > > > > you can put an XML declaration with the ISO-8859-1 encoding 
> > > > > > > > > > at the top
> > > > > > > > > > of your XML file:
> > > > > > > > > >
> > > > > > > > > > <?xml version="1.0" encoding="ISO-8859-1"?>
> > > > > > > > > >
> > > > > > > > > > You can also use an editor which knows how to handle UTF-8.
> > > > > > > > > >
> > > > > > > > > > In your case it is also possible that somebody inserted 
> > > > > > > > > > incorrect
> > > > > > > > > > characters by accident, and you can just remove those and 
> > > > > > > > > > then decide
> > > > > > > > > > which encoding you want to use. UTF-8 gives you the whole 
> > > > > > > > > > range of
> > > > > > > > > > Unicode, while ISO-8859-1 gives you a limited set of 
> > > > > > > > > > characters that
> > > > > > > > > > work for the Western languages.
> > > > > > > > > >
> > > > > > > > > > --- quote ---
> > > > > > > > > >
> > > > > > > > > > maybe that will help,
> > > > > > > > > > Brandon
> > > > > > > > > >
> > > > > > > > > > On 4/20/05, Brice Ruth <[EMAIL PROTECTED]> wrote:
> > > > > > > > > > > What special characters aren't supported by UTF-8?! I 
> > > > > > > > > > > have never heard
> > > > > > > > > > > of such a thing. My understanding is that UTF-8 
> > > > > > > > > > > represents the full
> > > > > > > > > > > Unicode character set as a multi-byte value. And since 
> > > > > > > > > > > Unicode is
> > > > > > > > > > > supposed to encompass all known characters for all known 
> > > > > > > > > > > languages
> > > > > > > > > > > (with space for new Chinese characters created daily) - 
> > > > > > > > > > > what's not
> > > > > > > > > > > covered?!
> > > > > > > > > > >
> > > > > > > > > > > There most certainly shouldn't be anything that 
> > > > > > > > > > > iso-8859-1 or latin1
> > > > > > > > > > > (Windows-1252) covers that is not in Unicode.
> > > > > > > > > > >
> > > > > > > > > > > Brice
> > > > > > > > > > >
> > > > > > > > > > > On 4/20/05, Daniel H. F. e Silva <[EMAIL PROTECTED]> 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > You could check also your xml encoding. If you work 
> > > > > > > > > > > > with special charaters not in utf-8, you will
> > > > > > > > > > > > get in trouble.
> > > > > > > > > > > > I had this as my native language is portuguese and we 
> > > > > > > > > > > > have some special characters not supported
> > > > > > > > > > > > by utf-8.
> > > > > > > > > > > > So, if this is your case, try iso-8859-1 or one that 
> > > > > > > > > > > > fits better to your needs.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > >  Daniel Silva.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --- Larry Meadors <[EMAIL PROTECTED]> wrote:
> > > > > > > > > > > > > Make sure that there is no white space and no odd 
> > > > > > > > > > > > > chars at the top of your
> > > > > > > > > > > > > config file.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Larry
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 4/18/05, KK <[EMAIL PROTECTED]> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I get the following error when I try to build 
> > > > > > > > > > > > > > sqlCOnfigmap..does it
> > > > > > > > > > > > > > look familiar to someone?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > com.ibatis.sqlmap.client.SqlMapException: There was 
> > > > > > > > > > > > > > an error while
> > > > > > > > > > > > > > building the SqlMap instance.
> > > > > > > > > > > > > > --- The error occurred in the SQL Map Configuration 
> > > > > > > > > > > > > > file.
> > > > > > > > > > > > > > --- Cause: 
> > > > > > > > > > > > > > com.ibatis.sqlmap.client.SqlMapException: XML 
> > > > > > > > > > > > > > Parser Error.
> > > > > > > > > > > > > > Cause: java.io.UTFDataFormatException: Invalid byte 
> > > > > > > > > > > > > > 3 of 3-byte UTF-8
> > > > > > > > > > > > > > sequence.
> > > > > > > > > > > > > > Caused by: java.io.UTFDataFormatException: Invalid 
> > > > > > > > > > > > > > byte 3 of 3-byte
> > > > > > > > > > > > > > UTF-8 sequence.
> > > > > > > > > > > > > > Caused by: 
> > > > > > > > > > > > > > com.ibatis.sqlmap.client.SqlMapException: XML 
> > > > > > > > > > > > > > Parser Error.
> > > > > > > > > > > > > > Cause: java.io.UTFDataFormatException: Invalid byte 
> > > > > > > > > > > > > > 3 of 3-byte UTF-8
> > > > > > > > > > > > > > sequence.
> > > > > > > > > > > > > > Caused by: java.io.UTFDataFormatException: Invalid 
> > > > > > > > > > > > > > byte 3 of 3-byte
> > > > > > > > > > > > > > UTF-8 sequence.
> > > > > > > > > > > > > > at 
> > > > > > > > > > > > > > com.ibatis.sqlmap.engine.builder.xml.XmlSqlMapClientBuilder.buildSqlMap
> > > > > > > > > > > > > > (XmlSqlMapClientBuilder.java:203)
> > > > > > > > > > > > > > at com.ibatis.sqlmap.client.
> > > > > > > > > > > > > > SqlMapClientBuilder.buildSqlMapClient(SqlMapClientBuilder.java:49)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Your help is greatly appreciated.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > KK
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > __________________________________________________
> > > > > > > > > > > > Do You Yahoo!?
> > > > > > > > > > > > Tired of spam?  Yahoo! Mail has the best spam 
> > > > > > > > > > > > protection around
> > > > > > > > > > > > http://mail.yahoo.com
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Brice Ruth
> > > > > > > > > > > Software Engineer, Madison WI
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Brice Ruth
> > > > > > > > > Software Engineer, Madison WI
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Brice Ruth
> > > > > > > Software Engineer, Madison WI
> > > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Brice Ruth
> > > > > Software Engineer, Madison WI
> > > > >
> > > >
> > >
> >
> 
> --
> Brice Ruth
> Software Engineer, Madison WI
>

Re: Help needed...

Reply via email to