Abdulhaq Lynch wrote:
I like this idea however I don't like (as you've probably guessed by now)
mixing up what is pure text, in the sense that it changes the meaning of the
words, and what indicates pronounciation. Therefore I would modify this such
that IDGHAAM, IKHFAA AND IQLAAB (TAMWEEN) are indicated by subsequent
codepoints:
TAMWEEM/IQLAAB = <vowel><small nuun><iqlaab> (was using small meem)
IDGHAAM = <vowel><small nuun><idghaam> (was using shadda on subsequent letter)
IKHFAA = <vowel><small nuun><ikhfaa> (was sequential blahblah)
and arguably, because it is redundant, I would add
IDHHAAR = <vowel><small nuun><idhhaar>
Likewise I would change the nuun with iqlaab, ikhfaa etc from
NUUN + IQLAAB was = <nuun><small meem>
to
NUUN + IQLAAB = <nuun><sukuun><iqlaab>
etc.
This has great benefits in terms of searching in that the tajweed codes can be
treated as whitespace and all vowels and sukuuns are easily identified.
Hi,
Interesting design option I hadn't thought of. But I see a couple of
objections:
1. It makes rendering more complex and expensive. With e.g.
<vowel><tanween><iqlab>
the rendering engine must always check the character following <tanween>
before it can decide what to do.
2. It doesn't reflect the graphic structure of the written text. The
iqlaab mark never accompanies a tanween mark nor a sukuun, does it? IMO
<tanween> should always generate a written mark; it's up to the
rendering engine to figure out which one.
As for searching, it's true that if you can only search for one
character at a time then searching for e.g. all indefinite nouns works
better with your proposal. But I recommend thinking of search
functionality in terms of regular expressions, which allow character
classes. So it's easy to search for any character that is member of the
class {<tanween>, <iqlaab>, <idghaam>}.
In fact, I'd go further. Rethinking encoding design from the ground up
frees us also to rethink basic text processing conventions. For search,
this means a rethinking of regular expression syntax. Standard regex
syntax supports standard character classes like [[:digit:]] and
[[:alpha:]]. Obviously the standard classes were designed for a
particular script. With Arabic, we need other classes, like
[[:haraka:]], [[:radical:]], etc. In particular, [[:tanween:]] to
denote the set listed above.
We can go yet further. Standard regex syntax uses the metacharacter '.'
(period) to denote "any character". Well, that's useful; but in Arabic
we have two fundamental classes of character: base chars and stackers
(for lack of better terminology). So we can define two more
metacharacters, for example ':' = any base char, and "~" = any stacker
(vowel, sukuun, etc.). Then the regex:
k:b
matches any string of three base chars starting with k and ending with
b, e.g. ktb, klb, ksb, etc. If we add a switch like --ignore-stackers,
then k:b would match the same consonants even with vowels, e.g. kataba,
kitaab, etc. The regex k~tb would match katb but not ktb.
There are lots of other interesting possibilities for design of regex
syntax for Arabic. (Note that we can do this even with Unicode as it is
now). And lots of open source regex engines. I looked at making this
kind of mod a few years ago, but I'm afraid this kind of hacking is a
little over my head at the moment (it's been a good five years since
I've coded). I'll bet there is somebody on this list who could code up
some experimental regex implementations relatively quickly.
-g
_______________________________________________
General mailing list
[email protected]
http://lists.arabeyes.org/mailman/listinfo/general