On 14 October 2010 16:28, Ben Kloosterman <[email protected]> wrote: > > >If I really want an indexed UTF-8 string, I say that, and if I > >want a rope, I also have to tell you that. > > I do favour pushing this to the user when its necessary but shouldn’t the > default be reasonably fast and memory efficient ? Especially when most of > those strings are just storage and not being used /worked on.
I'm not really sure how true that is. I can only speak from my own experiences, which are admittedly pretty limited. Most string data I see are: 0. Attributes of domain objects for display, or the results of applying simple functions to them. Peoples names, for example. I wouldn't have snarfed these out of the database if I wasn't using them. 1. XML messages (say, in a JMX environment). 2. HTML / XML templates. 3. Symbols. These things either seem to be processed in a stream-like manner (templates) or are manipulated and searched frequently. The only case where tighter storage space would be useful to me is in the symbol case. I haven't seen recent GC dumps from our translation toolchain, but IIRC there were a lot of strings in use as symbols. Within a parser itself it probably isn't important - you generate objects from the source and immediately throw the source away - but small strings float around in various metadata, such as debugging information, and method descriptors. > Also I don’t see BitC growing past a single runtime for a LONG time... :-) > I'm not sure it is meant to be the antithesis as I kind of like the smart > runtimes and the drop the file and it always runs rather than spending ages > hacking / configuring and trying to compile something . I mean smart in a semantic way, in that it will try to do a lot of clever optimisation. The intention seems to be 'do what I say, not what I mean', So you can reason about what happens near to the hardware by looking at the source. Selecting different string implementations depending on the construction of the string is something I'd classify as clever. I do like it, but I am not convinced it should be the default for BitC (yet). > Certainly not on the string itself but in the standard lib you could use a > 64 bit value and the high bit whether it's a byte index or char index when > setting the char index immediately convert to a byte. It wouldn’t be that > expensive since it is only indexes on active worked on strings. It would > just allow most users to use char indexes and when needed the real byte > indexes. Im not really sold on this idea...it is probably easier to just say > they are byte indexes .. So you are saying you (who may be the implementer of the string module) could convert to a specific representation where needed in string-heavy functions? I think we are starting to agree here. > >On 14 October 2010 13:25, Ben Kloosterman <[email protected]> wrote: > >> As 32 bit chars is clearly unacceptable > > > >For some things. I don't think UCS-4 is a particularly bad choice for > >an in-memory encoding, it's the one I'd choose, and I'm not even from > >California. > > :-) > I think the memory overhead is too bad in business application and DB > land... but it is a good candidate for the worked on fixedchararrays . Note > I'm not viewing it from the point of view of Unix apps reading the data and > then finishing but more from memory hungry Business apps, App servers , Web > servers , SOA servers , DBs etc where the strings stay in memory for a long > time while not used. If it is a problem, when it is a problem, when the app happens to be written in BitC, it would be nice to change the representation as needed. I think the verdict is still out on whether UTF-8 is yet a nice default, though. It certainly makes life more difficult for people who want to implement VMs in BitC, most of which imply O(1) indexing of their native string type, and having to always convert to use native string functions would be awkward. Do consider that most business applications are already written in Java or C#, and that Hotspot, for example, already incurs extra overhead to store extents on the string and room for a lock for synchronisation. People seem not to care a whole lot about small order of magnitude size increases for extra functionality. It's far cheaper to buy another server than to write C. -- William Leslie _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
