Just forgot to mention: LRM and RLM are considered General Punctuation, in the range U+2000..201B, see http://www.unicode.org/charts/PDF/U2000.pdf. Another character in this range which is important for Hebrew is U+201E, the DOUBLE LOW-9 QUOTATION MARK, which should be as the opening double quotation matk in Hebrew (see my pages below).
On Tue, 22 Jan 2002, Zvi Har'El wrote: > On Tue, 22 Jan 2002, Eli Marmor wrote: > > > Zvi Har'El wrote: > > > > umm, isn't UTF-8 8 bit with occasional 16? :) > > > > > > UTF-8 is one, two or three bytes per character. In the Hebrew case, a Hebrew > > > character is two bytes. > > > > Of course. > > But there are some special Hebrew characters (such as RLM/LRM, etc.) > > that are 3. > > To be precise, Hebrew characters are those with Unicode representaion > U+0590..05FF (see http://www.unicode.org/charts/PDF/U0590.pdf), and they all > occupy two bytes in the UTF-8 encoding. Only characters U+0800 and above need > three bytes (see utf-8(7) for detailes). LRM and RLM, the left to right mark > and right to left mark, are not Hebrew, even according the most liberal HOK > HASHEVUT (law of return). > > > And theoretically, UTF8 can handle up to 5 bytes. > > > > Six, to be precise, under the full UCS-4, which use 31-bit code space. > However, Unicode 3.0 uses only 16-bits code space (UCS-2) and thus can be > encoded into 3 bytes. Again, read utf-8(7) for detailes. BTW, in Java, > everything is internally Unicode, and char is two bytes long. In C, > wchar_t is 4 bytes long. > > -- Dr. Zvi Har'El mailto:[EMAIL PROTECTED] Department of Mathematics tel:+972-54-227607 Technion - Israel Institute of Technology fax:+972-4-8324654 http://www.math.technion.ac.il/~rl/ Haifa 32000, ISRAEL "If you can't say somethin' nice, don't say nothin' at all." -- Thumper (1942) Tuesday, 9 Shevat 5762, 22 January 2002, 1:25PM ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]