Re: [bitc-dev] Unicode and bitc

Michal Suchanek Fri, 15 Oct 2010 04:00:53 -0700

On 14 October 2010 09:03, William Leslie <[email protected]> wrote:
> On 14 October 2010 16:28, Ben Kloosterman <[email protected]> wrote:
>>
>>  >If I really want an indexed UTF-8 string, I say that, and if I
>>  >want a rope, I also have to tell you that.
>>
>> I do favour pushing this to the user  when its necessary but shouldn’t the
>> default be reasonably fast and memory efficient ? Especially when most of
>> those strings are just storage and not being used /worked on.
>
> I'm not really sure how true that is. I can only speak from my own
> experiences, which are admittedly pretty limited. Most string data I
> see are:
>
> 0. Attributes of domain objects for display, or the results of
> applying simple functions to them. Peoples names, for example. I
> wouldn't have snarfed these out of the database if I wasn't using
> them.

Yes, you would use the name but would you use it as a whole or would
you request just, say, the fourth character from the name?

Sure, you will likely want comparisons on them but those read until
they find a difference so they are sequential anyway.

>  >
>  >
>  >On 14 October 2010 13:25, Ben Kloosterman <[email protected]> wrote:
>  >> As 32 bit chars is clearly unacceptable
>  >
>  >For some things. I don't think UCS-4 is a particularly bad choice for
>  >an in-memory encoding, it's the one I'd choose, and I'm not even from
>  >California.
>
> :-)
> I think the memory overhead is too bad in business application and DB
> land... but it is a good candidate for the worked on  fixedchararrays . Note
> I'm not viewing it from the point of view of Unix apps reading the data and
> then finishing but more from memory hungry Business apps, App servers ,  Web
> servers , SOA servers , DBs etc where the strings stay in memory for a long
> time while not used.
>

I think this is in most cases caused by poor design which in turn is
caused by the relative expenses of buying more hardware and improving
the design.

On 14 October 2010 21:31, Jonathan S. Shapiro <[email protected]> wrote:
> On Wed, Oct 13, 2010 at 3:57 PM, Ben Kloosterman <[email protected]> wrote:
>>
>> It is correct ( I live in China at the moment)
>
> First, thank you for correcting me. The outcomes of Unicode are complex and
> sometimes counter-intuitive. It is very useful to get actual data from those
> who use it. For me, this is especially true in the Asian language context. I
> speak and write enough languages to have a handle on the right-left issue
> and the European language families in a broad sense, but I have no
> experience with ideographic languages or with Han, Kanji, the various Kanas,
> or the Chinese syllabary. And much of African language is totally beyond my
> experience.

AFAIK languages can use one or multiple writing systems that fall into
these categories

1) simple,alphabet - every letter represents a single consonant or
vowel with a few simple exceptions (like small groups of letters
having special meaning) - Greek, German, Latin, Russian, ..

2) syllabic,syllabary - a letter represents a syllable, possibly with
optional vowels, combining marks for vowels, etc. Japanese Kana,
Korean, Arabic, Hebrew, Indic scripts, ...

3) ligatures and special forms - Latin capital vs lowercase in Latin,
ti, fi, etc ligatures in Latin, Hiragana vs Katakana in Japanese,
starting/middle/ending forms in Arabic, multiple character ligatures
in Indic scripts, jamos vs compound characters in Korean, ...

4) ideograms -a letter or glyph representing fulll word, concept,
etc.-  special and math symbols (mostly universal), Han characters
(used in Chinese, Japanese, Korean), hieroglyphs of the older Ancient
Egyptian scripts (they were later used as simple letters)

5) idiosyncratic - the written form derives from tradition and is only
tangentially related to actual spoken language - English, possibly
French, ...

Note that many languages can use multiple representations of the same
word (with/without vowels, Kanji vs Kana, ligature vs multiple
letters) and that Unicode often provides multiple representations of
the same visual 'letter' - Jamos vs compound syllables, combining
marks vs compound characters...

Unicode also has its own idiosyncracies. In CJK some characters that
look similar in Trad. Chinese/ Simplified Chinese/Japanese were
initially assigned one codepoint and later when this caused problems
combining marks for selecting the right form so that the character
fits into texts/names in a particular language were added.

The native encodings for CJK include characters not in Unicode which
may or may not be added to Unicode eventually.

There are also encodings which include character variants to
accurately represent ancient Chinese texts, for example, which are
likely not going to be folded into Unicode.

So you should be prepared for situations when text in an external
encoding cannot be completely converted into the internal encoding.

On 14 October 2010 21:51, Jonathan S. Shapiro <[email protected]> wrote:
> On Wed, Oct 13, 2010 at 5:59 PM, Ben Kloosterman <[email protected]> wrote:
>
>>
>> True but there is a lot of code like  Index of start, index of end  take
>> substring.  This could be horrible say for </body> on a typical 2-3K html
>> page and even from </body>. would be bad.
>
> Not so. I think the misunderstanding lies here:
>
>> I assume the O(log)n is referring to the fact that in many cases the
>> search is not from the start. ..
>
> Not so. It's the time for an arbitrary indexing operation on an arbitrary
> string proceding de novo. The next/previous character operations are O(1) in
> all implementations **other than** UTF-8/UTF-16 (I disregard UTF-32, because

They are something like O(7) in UTF-8 and O(3) in UTF-16.

> in that case we would instead use ucs4[]). The UTF encodings do not
> synchronize backwards in any sensible fashion.
>
> The implementation I have in mind is as follows. Some of the low-level
> details are being made up as I go along.
>
> 1. Code points in that can be encoded in a single UTF-8 byte are represented
> as bytes. Code points that cannot be encoded in utf-8.1 but can be encoded
> in a single utf-16 unit are encoded as uint16. All others are encoded as
> uint32. This choice, by the way, is not innocent; it gives us leave to
> decide when a run is too short to justify switching encodings.
>

This is probably one of the most space efficient representations you
can get if you tune the "too short" length properly.

UTF-8 is most efficient for ASCII possibly with few special characters
added, UTF-16 for CJK text and other texts that don't use Latin but
fit into the 16bit codepoint space, UTF-32 for obscure languages (so
long as you want to stick with Unicode).

On 14 October 2010 21:53, Jonathan S. Shapiro <[email protected]> wrote:
> On Wed, Oct 13, 2010 at 4:31 PM, William Leslie
> <[email protected]> wrote:
>>
>> I mean to say that the in-memory format should favour efficiency of
>> iteration and slicing rather than space efficiency. Space efficient
>> representations can be reserved for serialisation. UTF-8 is a
>> fantastic wire format, and it's great on disk, but the space-saving
>> advantages are less important once you are in-memory.
>
> So you're okay with reducing the D-cache and D-TLB performance on
> large-scale programs, and therefore their overall performance, by a factor
> of >4? That seems a bit over-purist to me.
>
> So first, I think this is the wrong way to prioritize as a matter of
> defaults, but second, I think I've already made it clear that no either/or
> choice is actually required. The "stranded string" approach does all of what
> you want and more. The O(log n) factor issue is more than compensated for by
> the improvement in D-cache and D-TLB utilization.
>

By calling for a more complex implementation you are reducing the
I-cache and I-TLB (which may or may not be separate from data).

I can't say which is more important in which situation without tedious
analysis of running actual programs on actual hardware, though.

On 15 October 2010 09:44, Ben Kloosterman <[email protected]> wrote:
>>2010/10/15 Ben Kloosterman <[email protected]>:
>  >> The main cons I see is besides the tree index/reference cost , each
>  >> substring would need a field (which may be aligned to 4-8 bytes) or
>  >char  to
>  >> indicate the encoding and the higher initial / final parse overhead.
>  >
>  >I think shap imagines that there are different types for leaf nodes
>  >with different encodings, so the encoding is determined by the type/gc
>  >tag. So a string with one encoding type would appear in memory as
>
> This is really saying we have strands ( which are strings) of a certain
> encoding within a big string ..so it is mainly a abstraction wrapper .
>
> I think small string and big string separation may be better
>
> 1.A tree with separate types will incur quite a large cost eg even 2 empty
> references is 16 bytes which is a bit much for an empty string especially
> consider string arrays initialized to empty strings.  Doing a cout <<
> English chars << chinese char << English char etc  would be problematic  ,
> while it appears simple , in practice  you would need to parse and convert
> them all to say UCS-4 after having them as USC-1 and UCS-4 , else the tree
> becomes too big.

Since you are doing *external output* you need to convert *all
strings/strands/..* to the *external encoding* regardless of how they
are stored in your application.

>
> 2. The problem we are trying to solve ( GC , O(N) )  apply only to large
> strings so why pay the price for frequently used small strings. A horses for
> courses approach may fit better and the big string can solve a number of
> other problems.

Since *any* reference takes 8 bytes (the size of a pointer) I don't
see an empty string taking 8 bytes as an issue.

Thanks

Michal

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to