Dear All,

Attached is a small perl script for extracting the Arabic presentation
forms from the Unicode data files. Also the output created from the 3.0.1
version of the data files.

The license is GPL. Tell me if you think that should be changed.

Most interesting, is what I found after I was testing the program output:

1. Although many consider the U+0649 ALEF MAKSURA to be right-joining,
it's dual-joining (look at ArabicShaping.txt). All I programs I know
(incl. Microsoft ones), take it right-joining. One other interesting
property of this, is that it's presentaion forms also not adjacent.

2. The two character U+0677 and U+06BA has some but not all of their
presentation forms in Unicode.

--roozbeh

#!/usr/bin/perl
#
# This script extracts the Arabic presentation shapes from the data
# files available from http://www.unicode.org/Public/UNIDATA/
#
# Copyright (C) 2000 Roozbeh Pournader
# 
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#    
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# The GNU General Public License is available from
# http://www.gnu.org/copyleft/gpl.html
#
# Send bugs and suggestions to Roozbeh Pournader <[EMAIL PROTECTED]>
#

open (UNIDATA, "UnicodeData.txt")
        || die "can't open Unicode data file: $!";
open (ARABSHAP, "ArabicShaping.txt")
        || die "can't open shaping data file: $!";

@number{"isolated","final","initial","medial"} = (0,1,2,3);

while (<UNIDATA>) {
        if (/LETTER.*<(isolated|final|initial|medial)>/) {
                /([0-9A-F]*);.*<(isolated|final|initial|medial)> ([0-9A-F]*)/;
                $shape{$3}[$number{$2}] = $1;
        }
}

$shape{"0640"} = ["0640", "0640", "0640", "0640"];

# $class{"200D"} = "D";
# $class{"200C"} = "U";

while (<ARABSHAP>) {
        if (/^[0-9A-F]/) {
                /([0-9A-F]*);[^;]*; (.);/;
                $code = $1;
                $cl = $2;
                $cl =~ s/C/D/;
                $class{$code} = $cl;
        }
}

@shapecount{"U", "R", "D"} = (1, 2, 4);

foreach $key (keys (%shape)) {
        if (!defined($class{$key})) {
                $class{$key} = "U";
        }
        elsif ($class{$key} =~ /(D|R)/ ) {
                $count = $shapecount{$1};
                for ($i = 0; $i < $count; ++$i) {
                        if (!defined($shape{$key}[$i])) {
                                $shape{$key}[$i] = "????";
                        }
                }
        }
}

foreach $key (sort keys(%shape)) {
        print "$key $class{$key}";
        for ($i = 0; $i < $shapecount{$class{$key}}; ++$i) {
                print " $shape{$key}[$i]";
        }
        print "\n";
}
0621 U FE80
0622 R FE81 FE82
0623 R FE83 FE84
0624 R FE85 FE86
0625 R FE87 FE88
0626 D FE89 FE8A FE8B FE8C
0627 R FE8D FE8E
0628 D FE8F FE90 FE91 FE92
0629 R FE93 FE94
062A D FE95 FE96 FE97 FE98
062B D FE99 FE9A FE9B FE9C
062C D FE9D FE9E FE9F FEA0
062D D FEA1 FEA2 FEA3 FEA4
062E D FEA5 FEA6 FEA7 FEA8
062F R FEA9 FEAA
0630 R FEAB FEAC
0631 R FEAD FEAE
0632 R FEAF FEB0
0633 D FEB1 FEB2 FEB3 FEB4
0634 D FEB5 FEB6 FEB7 FEB8
0635 D FEB9 FEBA FEBB FEBC
0636 D FEBD FEBE FEBF FEC0
0637 D FEC1 FEC2 FEC3 FEC4
0638 D FEC5 FEC6 FEC7 FEC8
0639 D FEC9 FECA FECB FECC
063A D FECD FECE FECF FED0
0640 D 0640 0640 0640 0640
0641 D FED1 FED2 FED3 FED4
0642 D FED5 FED6 FED7 FED8
0643 D FED9 FEDA FEDB FEDC
0644 D FEDD FEDE FEDF FEE0
0645 D FEE1 FEE2 FEE3 FEE4
0646 D FEE5 FEE6 FEE7 FEE8
0647 D FEE9 FEEA FEEB FEEC
0648 R FEED FEEE
0649 D FEEF FEF0 FBE8 FBE9
064A D FEF1 FEF2 FEF3 FEF4
0671 R FB50 FB51
0677 R FBDD ????
0679 D FB66 FB67 FB68 FB69
067A D FB5E FB5F FB60 FB61
067B D FB52 FB53 FB54 FB55
067E D FB56 FB57 FB58 FB59
067F D FB62 FB63 FB64 FB65
0680 D FB5A FB5B FB5C FB5D
0683 D FB76 FB77 FB78 FB79
0684 D FB72 FB73 FB74 FB75
0686 D FB7A FB7B FB7C FB7D
0687 D FB7E FB7F FB80 FB81
0688 R FB88 FB89
068C R FB84 FB85
068D R FB82 FB83
068E R FB86 FB87
0691 R FB8C FB8D
0698 R FB8A FB8B
06A4 D FB6A FB6B FB6C FB6D
06A6 D FB6E FB6F FB70 FB71
06A9 D FB8E FB8F FB90 FB91
06AD D FBD3 FBD4 FBD5 FBD6
06AF D FB92 FB93 FB94 FB95
06B1 D FB9A FB9B FB9C FB9D
06B3 D FB96 FB97 FB98 FB99
06BA D FB9E FB9F ???? ????
06BB D FBA0 FBA1 FBA2 FBA3
06BE D FBAA FBAB FBAC FBAD
06C0 R FBA4 FBA5
06C1 D FBA6 FBA7 FBA8 FBA9
06C5 R FBE0 FBE1
06C6 R FBD9 FBDA
06C7 R FBD7 FBD8
06C8 R FBDB FBDC
06C9 R FBE2 FBE3
06CB R FBDE FBDF
06CC D FBFC FBFD FBFE FBFF
06D0 D FBE4 FBE5 FBE6 FBE7
06D2 R FBAE FBAF
06D3 R FBB0 FBB1

Reply via email to