>If this is the case, then it is better from a space perspective to use a UCS16 string than a stranded string. The underlying assumption with stranded strings is indeed that code points of like size occur in sequence in the input text.
The algorithm should and can tune this as appropriate for the language .. You may find on plain text it defaults to UCS-2 but for html it uses UCS-1 ( 1byte) and the occasional 2 byte sequence which wraps nearby 16 bit chars. The problem is files and content are rarely just a language you normally have some sort of framing/ layout especially web pages and XML after which you are lucky to end up with 50% native characters. Shap I don't not really know , the frequency . It is also a changing character set but I don't know how often common it is to create a new char ( composed from others) vs multiple chars for a word eg imported slang may create a new word.. I do know internet/network were new chars. I do know Japanese , Hong Kong , Taiwan and Korean are the biggest pain the avoid Unicode and mainly use Big 5 and other ASCII encodings . Ben From: [email protected] [mailto:[email protected]] On Behalf Of Jonathan S. Shapiro Sent: Saturday, October 16, 2010 2:46 PM To: Discussions about the BitC language Subject: Re: [bitc-dev] Unicode and bitc 2010/10/15 Tomasz Gajewski <[email protected]> In polish (and probably similarly for langauges other countries in middle and eastern Europe) text is composed mostly of ascii characters. But we have our special ones: "ąćęłńóśźż" which constitute almost 7% of letters in typical polish texts and only rarely exist in sequence. So it means that on average every 14'th character requires uint16 encoding. If this is the case, then it is better from a space perspective to use a UCS16 string than a stranded string. The underlying assumption with stranded strings is indeed that code points of like size occur in sequence in the input text. Ben: Do you have a sense of what the frequency and distribution is of extended code points in typical Chinese text? Anybody: same question for Japanese text and/or Han? As I said at one point earlier, we certainly have the option to store UCS8 characters within a UCS16 strand when doing so is more efficient than assembling adjacent strands. I can see some straightforward heuristics that could handle this sensibly, but doing it optimally requires sophistication and probably isn't worthwhile. shap No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.862 / Virus Database: 271.1.1/3183 - Release Date: 10/16/10 02:34:00
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
