Brett Henderson wrote:
Thanks for all this. These unicode problems are the bane of my existence :-) Any help is much appreciated.

I've run some experiments. I've been using the unicode character 0x10330 and experimenting with creating a test file then copying it via osmosis. It seems that if I create a tag containing a single instance of that character I can copy it okay. But when I create a tag with multiple 0x10330 characters it starts to get duplicated.
If anybody wishes to repeat my tests, I used the following code snippet in Java.

       int unicodeInput;
       StringBuilder builder;
unicodeInput = 0x10330; builder = new StringBuilder();
       //builder.append("prefix");
       for (int i = 0; i < 3; i++) {
           builder.append("x");
           builder.appendCodePoint(unicodeInput);
           builder.append("x");
       }
       //builder.append("suffix");
XmlWriter xmlWriter = new XmlWriter(new File("bh-test.osm"), CompressionMethod.None);
       Node node;
       node = new Node(1, 2, new Date(), OsmUser.NONE, 3, 4, 5);
       node.getTags().add(new Tag("test", builder.toString()));
       xmlWriter.process(new NodeContainer(node));
       xmlWriter.complete();
       xmlWriter.release();
FileInputStream iStream = new FileInputStream("bh-test.osm");
       InputStreamReader reader = new InputStreamReader(iStream, "UTF-8");
       BufferedReader bufferedReader = new BufferedReader(reader);
for (String line = bufferedReader.readLine(); line != null; line = bufferedReader.readLine()) {
           System.out.println(line);
       }


It first creates a string containing a unicode character requiring a surrogate pair when represented in UTF-16. It then creates an XmlWriter (the Osmosis class implementing the --write-xml/--wx task) and writes out a very basic osm file called bh-test.osm containing a single node with a tag with the previously created string as the value. Then it reads back the file and prints it to stdout.

The stdout will probably print '?' characters (at least on Windows), but under a debugger you can verify the characters are being read in correctly as UTF-16.

I then ran the file through the normal osmosis app to copy it into a new file. This triggered the bug.
osmosis --rx bh-test.osm --wx bh-test-out.osm

I was able to avoid the bug by using the Woodstox StAX XML parser.
osmosis --fast-read-xml-0.6 bh-test.osm --wx bh-test-out.osm

_______________________________________________
dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/dev

Reply via email to