There is charlint (http://www.w3.org/International/charlint/), which is
based on UTF-8. It may be possible to adapt it to UTF-16/32.
Regards, Martin.
On 2010/11/04 4:37, Jim Monty wrote:
Is there a utility, preferably open source and written in C, that inspects
UTF-16/UTF-16BE/UTF-16LE text
* Jim Monty wrote:
Unfortunately, I'm not a good enough programmer to write such a utility in C
or
even Perl, the language I know best. Is this a project that interests you, by
chance?
I'm surprised I'm having difficulty finding an existing utility to repair
broken
UTF-16 text. I thought
Jim Monty jim dot monty at yahoo dot com wrote:
I'm surprised I'm having difficulty finding an existing utility
to repair broken UTF-16 text. I thought this was something many
programmers would need, especially Web developers.
It may be that broken UTF-16 text doesn't appear that often in
Dear List
is there any tools or programs or database which can test CLDR collation?
Best
Ngwe Tun.
For the 6.0 version of UCA, there is a test file for CLDR collation in
http://www.unicode.org/Public/UCA/6.0.0/. See
http://www.unicode.org/Public/UCA/6.0.0/CollationAuxiliary.html. This does
not, however, test the customized rules for a given language. If you are a
programmer, a 'monkey-test'
On Thu, Nov 4, 2010 at 7:20 AM, Doug Ewell d...@ewellic.org wrote:
It may be that broken UTF-16 text doesn't appear that often in the real
world. Certainly it's a test case that should be detected and handled
(and I always do so when rolling my own transcoders), but perhaps not
many people
Markus Scherer wrote:
Doug Ewell wrote:
It may be that broken UTF-16 text doesn't appear that often in the
realworld.
16-bit Unicode is convenient in that when you find an unpaired surrogate
(that is, it's not well-formed UTF-16) you can usually just treat it like
a surrogate code point
On Thu, Nov 4, 2010 at 2:52 PM, Jim Monty jim.mo...@yahoo.com wrote:
In other words, when you process 16-bit Unicode text it takes no effort
to
handle unpaired surrogates, other than making sure that you only assemble
a
supplementary code point when a lead surrogate is really followed by
Thank you, Markus, for your clear, authoritative explanation and for talking me
down from the ledge.
Björn Höhrmann kindly suggested using 'uconv' that comes with ICU. That's what
I'll use to repair the corrupted UTF-16 text I have in hand.
My true object is to demonstrate to a software maker
Markus Scherer wrote:
While processing 16-bit Unicode text which is not assumed to be
well-formed UTF-16, you can treat (decode) an unpaired surrogate as a
mostly-inert surrogate code point. However, you cannot unambiguously
encode a surrogate code point in 16-bit text (because you could not
On Thu, Nov 4, 2010 at 5:46 PM, Doug Ewell d...@ewellic.org wrote:
I'm probably missing something here, but I don't agree that it's OK for a
consumer of UTF-16 to accept an unpaired surrogate without throwing an
error, or converting it to U+FFFD, or otherwise raising a fuss.
Various degrees
11 matches
Mail list logo