>If this is the case, then it is better from a space perspective to use a
UCS16 string than a stranded string. The underlying assumption with stranded
strings is indeed that code points of like size occur in sequence in the
input text.

 

The algorithm should and can tune this as appropriate for the language ..
You may find on plain text it defaults to UCS-2  but for html it uses UCS-1
( 1byte) and the occasional 2 byte sequence which wraps nearby 16 bit chars.

 

The problem is  files and content are rarely just a language you normally
have some sort of framing/ layout  especially web pages and XML after which
you are lucky to end up with 50% native characters.

 

Shap I don't not really know , the frequency . It is also a changing
character set but I don't know how often common it is to create a new char
( composed from others) vs  multiple chars for a word eg imported slang  may
create a new word.. I do know internet/network were new chars.

 

I do know Japanese , Hong Kong , Taiwan and Korean are the biggest pain the
avoid  Unicode and mainly use Big 5 and other ASCII encodings .

 

Ben 

 

From: [email protected] [mailto:[email protected]] On
Behalf Of Jonathan S. Shapiro
Sent: Saturday, October 16, 2010 2:46 PM
To: Discussions about the BitC language
Subject: Re: [bitc-dev] Unicode and bitc

 

2010/10/15 Tomasz Gajewski <[email protected]>

In polish (and probably similarly for langauges other countries in
middle and eastern Europe) text is composed mostly of ascii
characters. But we have our special ones: "ąćęłńóśźż" which constitute
almost 7% of letters in typical polish texts and only rarely exist in
sequence. So it means that on average every 14'th character requires
uint16 encoding.


If this is the case, then it is better from a space perspective to use a
UCS16 string than a stranded string. The underlying assumption with stranded
strings is indeed that code points of like size occur in sequence in the
input text.

Ben: Do you have a sense of what the frequency and distribution is of
extended code points in typical Chinese text?

Anybody: same question for Japanese text and/or Han?

As I said at one point earlier, we certainly have the option to store UCS8
characters within a UCS16 strand when doing so is more efficient than
assembling adjacent strands. I can see some straightforward heuristics that
could handle this sensibly, but doing it optimally requires sophistication
and probably isn't worthwhile.


shap

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.862 / Virus Database: 271.1.1/3183 - Release Date: 10/16/10
02:34:00

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to