Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-04 Thread Martin J. Dürst
There is charlint (http://www.w3.org/International/charlint/), which is based on UTF-8. It may be possible to adapt it to UTF-16/32. Regards, Martin. On 2010/11/04 4:37, Jim Monty wrote: Is there a utility, preferably open source and written in C, that inspects UTF-16/UTF-16BE/UTF-16LE text

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-04 Thread Bjoern Hoehrmann
* Jim Monty wrote: Unfortunately, I'm not a good enough programmer to write such a utility in C or even Perl, the language I know best. Is this a project that interests you, by chance? I'm surprised I'm having difficulty finding an existing utility to repair broken UTF-16 text. I thought

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-04 Thread Doug Ewell
Jim Monty jim dot monty at yahoo dot com wrote: I'm surprised I'm having difficulty finding an existing utility to repair broken UTF-16 text. I thought this was something many programmers would need, especially Web developers. It may be that broken UTF-16 text doesn't appear that often in

inquiry about collation testing

2010-11-04 Thread Ngwe Tun
Dear List is there any tools or programs or database which can test CLDR collation? Best Ngwe Tun.

Re: inquiry about collation testing

2010-11-04 Thread Mark Davis ☕
For the 6.0 version of UCA, there is a test file for CLDR collation in http://www.unicode.org/Public/UCA/6.0.0/. See http://www.unicode.org/Public/UCA/6.0.0/CollationAuxiliary.html. This does not, however, test the customized rules for a given language. If you are a programmer, a 'monkey-test'

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-04 Thread Markus Scherer
On Thu, Nov 4, 2010 at 7:20 AM, Doug Ewell d...@ewellic.org wrote: It may be that broken UTF-16 text doesn't appear that often in the real world. Certainly it's a test case that should be detected and handled (and I always do so when rolling my own transcoders), but perhaps not many people

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-04 Thread Jim Monty
Markus Scherer wrote: Doug Ewell wrote: It may be that broken UTF-16 text doesn't appear that often in the realworld. 16-bit Unicode is convenient in that when you find an unpaired surrogate (that is, it's not well-formed UTF-16) you can usually just treat it like a surrogate code point

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-04 Thread Markus Scherer
On Thu, Nov 4, 2010 at 2:52 PM, Jim Monty jim.mo...@yahoo.com wrote: In other words, when you process 16-bit Unicode text it takes no effort to handle unpaired surrogates, other than making sure that you only assemble a supplementary code point when a lead surrogate is really followed by

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-04 Thread Jim Monty
Thank you, Markus, for your clear, authoritative explanation and for talking me down from the ledge. Björn Höhrmann kindly suggested using 'uconv' that comes with ICU. That's what I'll use to repair the corrupted UTF-16 text I have in hand. My true object is to demonstrate to a software maker 

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-04 Thread Doug Ewell
Markus Scherer wrote: While processing 16-bit Unicode text which is not assumed to be well-formed UTF-16, you can treat (decode) an unpaired surrogate as a mostly-inert surrogate code point. However, you cannot unambiguously encode a surrogate code point in 16-bit text (because you could not

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-04 Thread Markus Scherer
On Thu, Nov 4, 2010 at 5:46 PM, Doug Ewell d...@ewellic.org wrote: I'm probably missing something here, but I don't agree that it's OK for a consumer of UTF-16 to accept an unpaired surrogate without throwing an error, or converting it to U+FFFD, or otherwise raising a fuss. Various degrees