On Mon, Apr 22, 2013 at 11:02 PM, Andi Vajda <[email protected]> wrote:
>>> Isn't UTF-32 used in Python (among other encodings)? > > Python is moving to a model where a string could be in any UTF width, based > on its characters: > > http://www.python.org/dev/peps/pep-0393/ Thanks, Andi. Like strings in Python (and Perl -- but not Java), strings in Clownfish have a requirement to support multiple encodings for the same logical content. Reviewing this PEP gave me the opportunity to rethink some assumptions I'd made when CharBuf was written. My expectation was that we'd ultimately support encoding variability through subclassing: CharBufUTF8, CharBufUTF16, and so on -- but Python has everything in one class. That would have seemed unwieldy for a mutable type, but maybe it's reasonable if our String type is immutable. Our motivations for supporting multiple internal encodings differ from those of Python. * In Python, the unfortunate idiom of treating strings as random-access character arrays has to be supported, so strings support multiple fixed-width representations (ASCII, UCS2, UTF-32) and the smallest width is chosen (according to the largest code point in the string) in order to minimize memory. * In Clownfish, we're driven by the need to interface with multiple host languages (though not at the same time, hmm). I suggested earlier that CharBuf might need only a single constructor, with an initial capacity argument -- but once we start supporting multiple encodings, that will have to be specified as well. However, for the sake of simplicity, robustness and speed, objects which are used to build up strings should probably support only one encoding. Nick, it seems to me that your iterators can work well with either a single-class or a subclassing approach, for both CharBuf and String. Thoughts? I'd prefer not to commit one way or the other yet -- we can implement an immutable String class while maintaining support for only UTF-8 right now, and take stock later on. There's going to be a lot of superficial churn in Lucy as we change `CharBuf` to `String` everywhere. The implementation changes later won't have such large ripple effects. Marvin Humphrey
