On 20/03/2008, tedd <[EMAIL PROTECTED]> wrote:
> At 9:29 PM +0200 3/19/08, Dotan Cohen wrote:
>  >I am asking the second question: how many Hebrew characters in a
>  >string that _very_likely_ contains other characters as well. The array
>  >suggestion sounds about what I am doing: checking if each letter is a
>  >Hebrew character.
>  >
>  >I will also look into the mb_ functions. I did not know about them
>  >before. Thanks.
>  >
>  >Dotan Cohen
>
>
> Dotan:
>
>  It really doesn't make any difference.
>
>  If you have a single character that is not ASCII, then it's something
>  beyond ASCII and you'll need to use the mb_functions.
>
>  Unicode contains all known characters (code points) including ASCII
>  with values equal to ASCII -- so there's no problem between code
>  points and ASCII.
>
>  The beyond ASCII string problem is basically what is a character? We
>  all know what an "a" is, but what about "a" with a "~" above it? Is
>  it one character or two? If it's a combination of two code points,
>  then it's a grapheme.
>
>  What about the character "fi" when it's combined? Is it one character
>  or two? In this case, it's a ligature and is a single code point.
>
>  So, when you are trying to count characters in a string, using ASCII
>  based functions won't work because they might count one character as
>  two and break the character in two parts. Or, the character might be
>  actually two characters, but they should be counted as one. As such,
>  mb_functions are designed to work with these types of problems where
>  as standard string functions won't.
>
>  The easy way to tell IF you should use mb_functions is if all the
>  characters you're working with appear in the ASCII table, then
>  standard string functions apply. However, if any of the characters
>  are not found in ASCII, then you need to go another route.
>
>  At least, that's my understanding.
>
>
>  Cheers,
>
>  tedd

Thank you Tedd, that was very helpful. After reading your mail from
yesterday I went to wikipedia to learn what graphemes and ligatures
are. Your example of "fi" was there, otherwise I would have had no
idea that those letters can be combined. In Hebrew and Arabic,
especially, I can see how the vowel points (Hebrew) and combinations
like "LA" (Arabic) can confuse the ASCII function. Thanks.

Dotan Cohen

http://what-is-what.com
http://gibberish.co.il
א-ב-ג-ד-ה-ו-ז-ח-ט-י-ך-כ-ל-ם-מ-ן-נ-ס-ע-ף-פ-ץ-צ-ק-ר-ש-ת

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Reply via email to