Marco van de Voort schrieb:
First you would have to come up with a workable model for s[x] being utf32chars in general that doesn't suffer from O(N^2) performance degradation (read/write)
Right, UTF-32 or UCS2 were much more useful in computations.
And for it to be useful, it must be workable for more than only the most basic loops, but also for e.g. if length(s)>1 then for i:=1 to length(s)-1 do s[i]:=s[i+1]; and similar loops that might read/write more than once, and use calculated expressions as the parameter to []
Okay, essentially you outlined why UTF is not a useful encoding for computation at all. Above loop body results in O(N^2) for every loop type, be counted or iterated, on data structures with non-uniform elements.
UTF encodings have their primary use in data storage and exchange with external API's. A useful data type and implementation would use a SxCS encoding internally, and accept or supply UTF-8 or UTF-16 strings only when explicitly asked for. All meaningful UTF/MBCS implementations already come with iterators, and only uneducated people would ever try to apply their SBCS-based algorithms and habits on MBCS encodings. At least they'd change their mind soon, after encountering the first bugs resulting from such an inappropriate approach.
BTW, I just found a similar inappropriate handling of digraphs in the scanner, where checks for such character combinations occur in many places, with no guarantee that all cases really are covered.
Furthermore I think that in detail Unicode string handling should not be based on single characters at all, but instead should use (sub)strings all over, covering multibyte character representations, ligatures etc. as well. Then the basic operations would be insertion and deletion of substrings, in addition to substring extraction and concatenation.
DoDi _______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel