Re: [bitc-dev] Unicode and bitc

Jonathan S. Shapiro Wed, 13 Oct 2010 15:14:03 -0700

On Tue, Oct 12, 2010 at 7:00 PM, Ben Kloosterman <[email protected]> wrote:


> Looking further , UCS-2 is now regarded as obsolete as a document
> representation and UTF-16 is not the same as it has variable sized
> extensions....
>
  ( Note all UCS-2 is readable by  UTF-16 but not the reverse)...
>

UCS2 was never a document format. UCS2 describes a format for *code units*.
UTF-16 describes one of many strategies for serializing code units into
well-formed "strings".


> yet basic indexing of variable sized format UTF-8 or UTF-16 is misleading
> to developers as you nearly always need to do a O(n) scan from the start
> this means you need different methods to handle it optimally...
>

This simply isn't true. Or rather, it's true exactly if your string I/O
library is gratuitously stupid. Observe that there is an inevitable point
where an O(n) pass must be made over the string in any case, which is when
you [de]serialize it. You have to do the scan at that point anyway. All you
have to do beyond that point is maintain the data structure. Once you have a
rope-like string structure, indexing by any unit you like is log(n) in the
number of ropes. In practice, the overwhelming majority of strings have
homogeneous code points even in a rope-based implementation, so "n" doesn't
tend to get very big.

Aside: once again you have a layering confusion here (as did I, in this
case). UTF-n refers to how the string is externally represented. A UTF-n
representation can be used as the internal representation if you like, but
strings are (logically) indexed at code points. The problem is that a great
deal of software and one or two key libraries (notably XPath) got their
index handling specifications wrong, so people end up needing to index by
code units as well.


> While Java does have excellent XML parsers there are plenty of good C  ones
> which do utf-8. Libxml2-SAX blows away Java ones by 30-50%  ,working in
> UCS-2 means you may not be able to meet your c performance goal...
>

Once again, this is a mildly confused statement. There is a root problem
here that XPath and XML content model disagree about indexing. One indexes
by code points, the other by UCS2 code units (I forget which is which). Any
conforming implementation (regardless of language) needs to do both. Java
loses here almost entirely because of (a) copy costs, and (b) indexing
overheads induced by an utterly horrible concurrency model.


> 2. The model I propose is very careful not to take any position that
> commits the implementation to a particular representation. I'ld note that
> the IBM ICU components have a very strong string implementation that
> satisfies all of the concerns you raise while retaining perfectly fine
> in-memory space performance
>
>
>
> Java still suffers from excessive memory usage on embedded devices and
> their SAX xml parsers are still inferior to C.  Regarding taking a position
> that is true but note as I said an  indexer on a string implies to a
> developer a fixed with implementation which can only be ASCII , UCS-2 ,
> UCS-4 and UTF32  without causing developers to write unexpectedly poorly
> performing code for UTF-8 and UTF-16...
>

I'm not clear why we are even talking about Java here. ICU is also
implemented in C++, for example.

As to developers drawing incorrect inferences derived from insufficient
understanding of how a good string implementation is built, that's not my
responsibility. :-) [No shot at you personally is intended - string reps are
subtle things to do well.]

> -          Strings are immutable  , providing GC benefits as well as multi
> threading esp the diabolical string changed  by other thread issue.
>
In a good implementation this is not necessarily the case. Strings are
logically immutable at the programming API layer, but mutable at the
implementation layer. It's not possible to do an efficient job with
substring operations without this - you really want shared substructure. The
Cedar/Mesa "Ropes" implementation was very good in this regard, and extends
gracefully to deal with mixed code point sizes.

shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to