tanween and regex

Gregg Reynolds Fri, 01 Jul 2005 09:15:12 -0700

Abdulhaq Lynch wrote:

I like this idea however I don't like (as you've probably guessed by now)mixing up what is pure text, in the sense that it changes the meaning of thewords, and what indicates pronounciation. Therefore I would modify this suchthat IDGHAAM, IKHFAA AND IQLAAB (TAMWEEN) are indicated by subsequentcodepoints:
TAMWEEM/IQLAAB = <vowel><iqlaab> (was using small meem)
IDGHAAM = <vowel><idghaam> (was using shadda on subsequent letter)
IKHFAA = <vowel><ikhfaa> (was sequential blahblah)

and arguably, because it is redundant, I would add

IDHHAAR = <vowel><idhhaar>

Likewise I would change the nuun with iqlaab, ikhfaa etc from

NUUN + IQLAAB was = <nuun>

to

NUUN + IQLAAB = <nuun><sukuun><iqlaab>

etc.
This has great benefits in terms of searching in that the tajweed codes can betreated as whitespace and all vowels and sukuuns are easily identified.

Hi,

Interesting design option I hadn't thought of. But I see a couple ofobjections:


1.  It makes rendering more complex and expensive.  With e.g.
        <vowel><tanween><iqlab>

the rendering engine must always check the character following <tanween>before it can decide what to do.

2. It doesn't reflect the graphic structure of the written text. Theiqlaab mark never accompanies a tanween mark nor a sukuun, does it? IMO<tanween> should always generate a written mark; it's up to therendering engine to figure out which one.

As for searching, it's true that if you can only search for onecharacter at a time then searching for e.g. all indefinite nouns worksbetter with your proposal. But I recommend thinking of searchfunctionality in terms of regular expressions, which allow characterclasses. So it's easy to search for any character that is member of theclass {<tanween>, <iqlaab>, <idghaam>}.

In fact, I'd go further. Rethinking encoding design from the ground upfrees us also to rethink basic text processing conventions. For search,this means a rethinking of regular expression syntax. Standard regexsyntax supports standard character classes like [[:digit:]] and[[:alpha:]]. Obviously the standard classes were designed for aparticular script. With Arabic, we need other classes, like[[:haraka:]], [[:radical:]], etc. In particular, [[:tanween:]] todenote the set listed above.

We can go yet further. Standard regex syntax uses the metacharacter '.'(period) to denote "any character". Well, that's useful; but in Arabicwe have two fundamental classes of character: base chars and stackers(for lack of better terminology). So we can define two moremetacharacters, for example ':' = any base char, and "~" = any stacker(vowel, sukuun, etc.). Then the regex:

k:b

matches any string of three base chars starting with k and ending withb, e.g. ktb, klb, ksb, etc. If we add a switch like --ignore-stackers,then k:b would match the same consonants even with vowels, e.g. kataba,kitaab, etc. The regex k~tb would match katb but not ktb.

There are lots of other interesting possibilities for design of regexsyntax for Arabic. (Note that we can do this even with Unicode as it isnow). And lots of open source regex engines. I looked at making thiskind of mod a few years ago, but I'm afraid this kind of hacking is alittle over my head at the moment (it's been a good five years sinceI've coded). I'll bet there is somebody on this list who could code upsome experimental regex implementations relatively quickly.


-g


_______________________________________________
General mailing list
[email protected]
http://lists.arabeyes.org/mailman/listinfo/general

tanween and regex

رد على