Interestingly, it's actually valid UTF8 (they are ASCII control characters). The problem is that XML defines a subset of Unicode characters that excludes these and a few other ranges.
http://www.w3.org/TR/REC-xml/#NT-Char None of the rails code is explicitly aware of the difference between UTF8 and this XML-UTF8-subset. All the XML parsing is done by libxml2* (so we haven't come across this distinction before) but this was inputted via Potlatch and so wasn't parsed by an XML parser. Arguably it does the right thing, because during the API 0.6 we decided that "all UTF8" would be valid in OSM tags (and that there wouldn't be any normalization between e.g. e-acute and e+combining acute etc etc) but maybe we should tweak that definition to say only "all XML-UTF8-subset characters" as defined in the above link are permitted. Test cases and code fixes to follow. This was all figured out by Matt. Cheers, Andy * hopefully, but that's not been audited On Tue, Jul 14, 2009 at 9:42 AM, Jon Burgess<[email protected]> wrote: > I noticed that the diff parsing on the tile server stopped this morning. > This changeset seems to be the cause: > > $ gzip -dc 200907140650-200907140651.osc.gz | xmllint -noout - > -:36: parser error : invalid character in attribute value > <tag k="name" v="▒Meycauayan City Northbound Entry Point"/> > ^ > -:36: parser error : attributes construct error > <tag k="name" v="▒Meycauayan City Northbound Entry Point"/> > ^ > > http://www.openstreetmap.org/browse/node/410383150 > http://www.openstreetmap.org/browse/node/441527354 > > > Jon > > > > _______________________________________________ > dev mailing list > [email protected] > http://lists.openstreetmap.org/listinfo/dev > _______________________________________________ dev mailing list [email protected] http://lists.openstreetmap.org/listinfo/dev

