On 01/10/12 13:02, Noel Grandin wrote: > > On 2012-10-01 12:38, Michael Meeks wrote: >> We could do some magic there; of course - space is a bit of an issue - >> we already pointlessly bloat bazillions of ascii strings into UCS-2 >> (nominally UTF-16) representations and nail a ref-count and length on >> the beginning. If you turn on the lifecycle diagnostics in >> sal/rtl/source/strimp.hxx with the #ifdef and re-build sal, you can >> start to see the scale of the problem when you launch libreoffice ;-) > > Changing subject because I'm changing the topic. > > That was something I was thinking about the other day - given than the > bulk of our strings are pure 7-bit ASCII, it might be a worthwhile > optimisation to store a bit that says "this string is 7-bit ASCII", and > then store the string as a sequence of bytes. > > The latest Java VM does this trick internally - it pretends that String > is stored with an array of 16-bit values, but actually it stores them as > UTF-8.
it does that? impressive that they could dig their way out of the utf-16 hole... but whatever they are doing won't be possible with our OUStrings that directly expose the internal sal_Unicode array. > Even in an app running in a language other than US-English, strings are > used for so many internal things that >90% of the strings are 7-bit ASCII. space overhead is one problem with UTF16 strings, but there are other problems as well: they are very error prone to use in an application like LO that really must be 100% i18n-able: with UTF-16 it's all too easy to write loops over the 16-bit code units without taking into account the possibility that there are Unicode code points that are actually represented by not one but two UTF-16 code units, leading to real i18n bugs that are very difficult to detect because they only happen with rather obscure languages; i.e. UTF-16 manages to combine the size overhead of UCS-4 and variable length of UTF-8 into the worst of both worlds. with a UTF-8 string these i18n bugs would be very easy to detect since they happen in pretty much every non-English language; you don't need to be able to write Cuneiform to see the problem. iteration should be done with a dedicated method that returns the next code point as a int32_t. also a UTF-8 string could be really constant: just write an ordinary string literal in C++ and wrap a value class around it, no memory allocation needed. ... which brings me to another point: in a hypothetical future when we could efficiently create a UTF8String from a string literal in C++ without copying the darn thing, what should hypothetical operations to mutate the string's buffer do? _______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice