On 14 October 2010 09:16, Jonathan S. Shapiro <[email protected]> wrote: > On Tue, Oct 12, 2010 at 9:47 PM, William Leslie > <[email protected]> wrote: >> >> How can we attribute the performance difference between these xml >> parsers to encoding? Where are the benchmarks? >> >> Memory usage of strings probably isn't as important as you think... > > I think this is incorrect. In UNIX programs circa 1990, 20% of live > in-memory data on workstations was character string data. By 2000, that > number was closer to 60%. The proportion on servers is much higher. So size > of character representation matters both for memory usage reasons and for > cache bandwidth reasons - the latter probably more compelling than the > former.
I mean to say that the in-memory format should favour efficiency of iteration and slicing rather than space efficiency. Space efficient representations can be reserved for serialisation. UTF-8 is a fantastic wire format, and it's great on disk, but the space-saving advantages are less important once you are in-memory. If you are dealing with large-enough strings for the trade-off not to be worth it, you can probably hack having to deal with a different representation. One of the things I'd like to see BitC class languages used for is writing relational and RDF databases, where the user typically wants to configure collation and the encoding of the on-disk format. For these sort of situations, 60% even seems a bit low. Outside of these special cases, though, strings are probably mostly in the cache or mostly out of cache, and the small conceptual increase in cache size that UTF-8 would provide could be outweighed by other factors. >> - for >> large strings, you are probably more interested in using a stream >> decoder then a great big in-memory string, and if that doesn't suit >> your use case, you probably want to implement your own string type, >> whether that be ropes or an array in utf-8 or whatever. > > Possibly, and perhaps, but if so then you aren't concerned about the native > string representation. Exactly: I'm saying that the needs of particular cases like SAX parsing and joins between many-gig tables need not dictate the native string representation, because there are better ways to optimise for such cases. -- William Leslie _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
