On Fri, Aug 14, 2009 at 12:54 PM, Frederik Ramm<[email protected]> wrote: > Hi, > > Frederik Ramm wrote: >> The result file should have been something like 400 bytes. This sounds >> trivial but in the original case where the .osc contained a large number >> of these characters, I suddenly had 2 MB of data in one tag. > > I forgot to mention: I'm posting this here on dev and not on the osmosis > list because it seems that other (at least Java) programs are also > affected; someone fixed then node later with a commit comment of "JOSM > says string too long" or so...
The code points for these gothic characters are fine. See the following (awesome) site: http://decodeunicode.org/en/gothic A rough transliteration is HEJSPANOA. However, they lie outside the Basic Multilingual Plane (BMP) and can't be represented by a 16bit integer. Java stores characters internally as 16-bit UCS-2 characters and so everything is going horribly wrong. IANAJavaProgrammer, but there's lots of very relevant stuff on http://en.wikipedia.org/wiki/UTF-16 with the following choice quotes: "UCS-2 (2-byte Universal Character Set) is an obsolete character encoding which is a predecessor to UTF-16. The UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF." NB: U+10337 is outside that range "Java used UCS-2 initially, and added UTF-16 supplementary character support in J2SE 5.0. Note that several widely-used String methods can still create and return unpaired surrogates; e.g. any code written assuming that substring is always safe, or that charAt returns a unicode character, may give rise to bugs[4][5]." I'm guessing more unit tests need writing ;-) Cheers, Andy In the meantime of course, let BAN JOSM!!!!!!11!11 :-) _______________________________________________ dev mailing list [email protected] http://lists.openstreetmap.org/listinfo/dev

