On 20/03/2008, tedd <[EMAIL PROTECTED]> wrote: > At 9:29 PM +0200 3/19/08, Dotan Cohen wrote: > >I am asking the second question: how many Hebrew characters in a > >string that _very_likely_ contains other characters as well. The array > >suggestion sounds about what I am doing: checking if each letter is a > >Hebrew character. > > > >I will also look into the mb_ functions. I did not know about them > >before. Thanks. > > > >Dotan Cohen > > > Dotan: > > It really doesn't make any difference. > > If you have a single character that is not ASCII, then it's something > beyond ASCII and you'll need to use the mb_functions. > > Unicode contains all known characters (code points) including ASCII > with values equal to ASCII -- so there's no problem between code > points and ASCII. > > The beyond ASCII string problem is basically what is a character? We > all know what an "a" is, but what about "a" with a "~" above it? Is > it one character or two? If it's a combination of two code points, > then it's a grapheme. > > What about the character "fi" when it's combined? Is it one character > or two? In this case, it's a ligature and is a single code point. > > So, when you are trying to count characters in a string, using ASCII > based functions won't work because they might count one character as > two and break the character in two parts. Or, the character might be > actually two characters, but they should be counted as one. As such, > mb_functions are designed to work with these types of problems where > as standard string functions won't. > > The easy way to tell IF you should use mb_functions is if all the > characters you're working with appear in the ASCII table, then > standard string functions apply. However, if any of the characters > are not found in ASCII, then you need to go another route. > > At least, that's my understanding. > > > Cheers, > > tedd
Thank you Tedd, that was very helpful. After reading your mail from yesterday I went to wikipedia to learn what graphemes and ligatures are. Your example of "fi" was there, otherwise I would have had no idea that those letters can be combined. In Hebrew and Arabic, especially, I can see how the vowel points (Hebrew) and combinations like "LA" (Arabic) can confuse the ASCII function. Thanks. Dotan Cohen http://what-is-what.com http://gibberish.co.il א-ב-ג-ד-ה-ו-ז-ח-ט-י-ך-כ-ל-ם-מ-ן-נ-ס-ע-ף-פ-ץ-צ-ק-ר-ש-ת A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?