Dear All, Attached is a small perl script for extracting the Arabic presentation forms from the Unicode data files. Also the output created from the 3.0.1 version of the data files. The license is GPL. Tell me if you think that should be changed. Most interesting, is what I found after I was testing the program output: 1. Although many consider the U+0649 ALEF MAKSURA to be right-joining, it's dual-joining (look at ArabicShaping.txt). All I programs I know (incl. Microsoft ones), take it right-joining. One other interesting property of this, is that it's presentaion forms also not adjacent. 2. The two character U+0677 and U+06BA has some but not all of their presentation forms in Unicode. --roozbeh
#!/usr/bin/perl # # This script extracts the Arabic presentation shapes from the data # files available from http://www.unicode.org/Public/UNIDATA/ # # Copyright (C) 2000 Roozbeh Pournader # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # The GNU General Public License is available from # http://www.gnu.org/copyleft/gpl.html # # Send bugs and suggestions to Roozbeh Pournader <[EMAIL PROTECTED]> # open (UNIDATA, "UnicodeData.txt") || die "can't open Unicode data file: $!"; open (ARABSHAP, "ArabicShaping.txt") || die "can't open shaping data file: $!"; @number{"isolated","final","initial","medial"} = (0,1,2,3); while (<UNIDATA>) { if (/LETTER.*<(isolated|final|initial|medial)>/) { /([0-9A-F]*);.*<(isolated|final|initial|medial)> ([0-9A-F]*)/; $shape{$3}[$number{$2}] = $1; } } $shape{"0640"} = ["0640", "0640", "0640", "0640"]; # $class{"200D"} = "D"; # $class{"200C"} = "U"; while (<ARABSHAP>) { if (/^[0-9A-F]/) { /([0-9A-F]*);[^;]*; (.);/; $code = $1; $cl = $2; $cl =~ s/C/D/; $class{$code} = $cl; } } @shapecount{"U", "R", "D"} = (1, 2, 4); foreach $key (keys (%shape)) { if (!defined($class{$key})) { $class{$key} = "U"; } elsif ($class{$key} =~ /(D|R)/ ) { $count = $shapecount{$1}; for ($i = 0; $i < $count; ++$i) { if (!defined($shape{$key}[$i])) { $shape{$key}[$i] = "????"; } } } } foreach $key (sort keys(%shape)) { print "$key $class{$key}"; for ($i = 0; $i < $shapecount{$class{$key}}; ++$i) { print " $shape{$key}[$i]"; } print "\n"; }
0621 U FE80 0622 R FE81 FE82 0623 R FE83 FE84 0624 R FE85 FE86 0625 R FE87 FE88 0626 D FE89 FE8A FE8B FE8C 0627 R FE8D FE8E 0628 D FE8F FE90 FE91 FE92 0629 R FE93 FE94 062A D FE95 FE96 FE97 FE98 062B D FE99 FE9A FE9B FE9C 062C D FE9D FE9E FE9F FEA0 062D D FEA1 FEA2 FEA3 FEA4 062E D FEA5 FEA6 FEA7 FEA8 062F R FEA9 FEAA 0630 R FEAB FEAC 0631 R FEAD FEAE 0632 R FEAF FEB0 0633 D FEB1 FEB2 FEB3 FEB4 0634 D FEB5 FEB6 FEB7 FEB8 0635 D FEB9 FEBA FEBB FEBC 0636 D FEBD FEBE FEBF FEC0 0637 D FEC1 FEC2 FEC3 FEC4 0638 D FEC5 FEC6 FEC7 FEC8 0639 D FEC9 FECA FECB FECC 063A D FECD FECE FECF FED0 0640 D 0640 0640 0640 0640 0641 D FED1 FED2 FED3 FED4 0642 D FED5 FED6 FED7 FED8 0643 D FED9 FEDA FEDB FEDC 0644 D FEDD FEDE FEDF FEE0 0645 D FEE1 FEE2 FEE3 FEE4 0646 D FEE5 FEE6 FEE7 FEE8 0647 D FEE9 FEEA FEEB FEEC 0648 R FEED FEEE 0649 D FEEF FEF0 FBE8 FBE9 064A D FEF1 FEF2 FEF3 FEF4 0671 R FB50 FB51 0677 R FBDD ???? 0679 D FB66 FB67 FB68 FB69 067A D FB5E FB5F FB60 FB61 067B D FB52 FB53 FB54 FB55 067E D FB56 FB57 FB58 FB59 067F D FB62 FB63 FB64 FB65 0680 D FB5A FB5B FB5C FB5D 0683 D FB76 FB77 FB78 FB79 0684 D FB72 FB73 FB74 FB75 0686 D FB7A FB7B FB7C FB7D 0687 D FB7E FB7F FB80 FB81 0688 R FB88 FB89 068C R FB84 FB85 068D R FB82 FB83 068E R FB86 FB87 0691 R FB8C FB8D 0698 R FB8A FB8B 06A4 D FB6A FB6B FB6C FB6D 06A6 D FB6E FB6F FB70 FB71 06A9 D FB8E FB8F FB90 FB91 06AD D FBD3 FBD4 FBD5 FBD6 06AF D FB92 FB93 FB94 FB95 06B1 D FB9A FB9B FB9C FB9D 06B3 D FB96 FB97 FB98 FB99 06BA D FB9E FB9F ???? ???? 06BB D FBA0 FBA1 FBA2 FBA3 06BE D FBAA FBAB FBAC FBAD 06C0 R FBA4 FBA5 06C1 D FBA6 FBA7 FBA8 FBA9 06C5 R FBE0 FBE1 06C6 R FBD9 FBDA 06C7 R FBD7 FBD8 06C8 R FBDB FBDC 06C9 R FBE2 FBE3 06CB R FBDE FBDF 06CC D FBFC FBFD FBFE FBFF 06D0 D FBE4 FBE5 FBE6 FBE7 06D2 R FBAE FBAF 06D3 R FBB0 FBB1
