...? problem yesterday

Brett Henderson Sun, 16 Aug 2009 00:59:13 -0700

Andy Allan wrote:

On Fri, Aug 14, 2009 at 3:16 PM, Andy Allan<[email protected]> wrote:

On Fri, Aug 14, 2009 at 12:54 PM, Frederik Ramm<[email protected]> wrote:

Hi,


Frederik Ramm wrote:

The result file should have been something like 400 bytes. This sounds
trivial but in the original case where the .osc contained a large number
of these characters, I suddenly had 2 MB of data in one tag.

I forgot to mention: I'm posting this here on dev and not on the osmosis
list because it seems that other (at least Java) programs are also
affected; someone fixed then node later with a commit comment of "JOSM
says string too long" or so...

The code points for these gothic characters are fine. See the
following (awesome) site:

http://decodeunicode.org/en/gothic

A rough transliteration is HEJSPANOA. However, they lie outside the
Basic Multilingual Plane (BMP) and can't be represented by a 16bit
integer. Java stores characters internally as 16-bit UCS-2 characters
and so everything is going horribly wrong.


Installing an SMP-aware font shows what JOSM is doing more easily than
reading Unicode code-points.

http://code2000.net/code2001.htm

I'll keep my (horrid) transliterations going here for the sake of everyone else.

v31 - HEJSPANOA
v32 - HHEHEJHEJSHEJSPHEJSPAHEJSPANHEJSPANOHEJSPANOA

i.e. the first letter, the first two letters, the first three letters etc.

I can see how you can quickly end up with a 2MB tag using this encoding scheme!

Cheers,
Andy

Thanks for all this. These unicode problems are the bane of myexistence :-) Any help is much appreciated.

I've run some experiments. I've been using the unicode character0x10330 and experimenting with creating a test file then copying it viaosmosis. It seems that if I create a tag containing a single instanceof that character I can copy it okay. But when I create a tag withmultiple 0x10330 characters it starts to get duplicated.


If I create a tag with a single 0x10330 character it gets copied correctly.

If I surround the character with normal latin characters it copiescorrectly.

If I put 2 0x10330 characters in the tag, 3 get written to the output.
If I put 3 0x10330 characters in the tag, 6 get written to the output.

If I make each 0x10330 character non-consecutive by surrounding themwith latin characters they still get duplicated.

I've run this under a debugger and it seems that the data getsduplicated during input, not output. My ElementWriter class may havesome issues with surrogate pairs, but it appears that it isn't thesource of this problem.

I've opened the file directly using a UTF-8 input stream under adebugger and the characters are read in correctly there as well.

I've tried using the osmosis --fast-read-xml-0.6 task and the problemgoes away. This alternative XML reading task uses the Woodstox StAX XMLparser.

So to summarise it seems like the standard Java XML parser (based onApache Xerces I believe) is somehow introducing surrogate pairduplication when multiple surrogate pairs are involved. I don't know ifthis is a bug or a problem in how we're using it. I'm always hesitantto assume bugs in the Java runtime but it seems like there might be onehere.


Options:

1. Try to find the source of the problem in the Java XML parser or ouruse of it.2. Switch over to the Woodstox StAX XML parser which isn't exhibitingthe problem.

Given that Woodstox StAX parsing gives an approx 20% performanceimprovement, it might be a good time to implement option 2.


Brett

_______________________________________________
dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/dev

Re: [OSM-dev] strange Osmosis/XML/...? problem yesterday

Reply via email to