On 2014-04-16 12.51, Kevin Bracey wrote:
> On 16/04/2014 07:48, Torsten Bögershausen wrote:
>> On 15.04.14 21:10, Peter Krefting wrote:
>>> Torsten Bögershausen:
>>>> diff --git a/utf8.c b/utf8.c
>>>> index a831d50..77c28d4 100644
>>>> --- a/utf8.c
>>>> +++ b/utf8.c
>>> Is there a script that generates this code from the Unicode database files,
>>> or did you hand-update it?
>> Some of the code points which have "0 length on the display" are called
>> "combining", others are called "vowels" or "accents".
>> E.g. 5BF is not marked any of them, but if you look at the glyph, it should
>> be combining (please correct me if that is wrong).
> Indeed it is combining (more specifically it has General Category
> "Nonspacing_Mark" = "Mn").
>> If I could have found a file which indicates for each code point, what it
>> is, I could write a script.
> The most complete and machine-readable data are in these files:
> The general categories can also be seen more legibly in:
> For docs, see:
> The existing utf8.c comments describe the attributes being selected from the
> tables (general categories "Cf","Mn","Me", East Asian Width "W", "F"). And
> they suggest that the combining character table was originally auto-generated
> from UnicodeData.txt with a "uniset" tool. Presumably this?
> The fullwidth-checking code looks like it was done by hand, although
> apparently uniset can process EastAsianWidth.txt.
Excellent, thanks for the pointers.
Running the script below shows that
"0X00AD SOFT HYPHEN" should have zero length (and some others too).
I wonder if that is really the case, and which one of the last 2 lines
in the script is the right one.
What does this mean for us:
"Cf Format a format control character"
if ! test -f UnicodeData.txt; then
if ! test -f EastAsianWidth.txt; then
if ! test -f DerivedGeneralCategory.txt; then
if ! test -d uniset; then
git clone https://github.com/tboegi/uniset.git
cd uniset &&
if ! test -x uniset; then
autoreconf -i &&
./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb'
UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf
#UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html