On 14 October 2010 09:16, Jonathan S. Shapiro <[email protected]> wrote:
> On Tue, Oct 12, 2010 at 9:47 PM, William Leslie
> <[email protected]> wrote:
>>
>> How can we attribute the performance difference between these xml
>> parsers to encoding? Where are the benchmarks?
>>
>> Memory usage of strings probably isn't as important as you think...
>
> I think this is incorrect. In UNIX programs circa 1990, 20% of live
> in-memory data on workstations was character string data. By 2000, that
> number was closer to 60%. The proportion on servers is much higher. So size
> of character representation matters both for memory usage reasons and for
> cache bandwidth reasons - the latter probably more compelling than the
> former.

I mean to say that the in-memory format should favour efficiency of
iteration and slicing rather than space efficiency. Space efficient
representations can be reserved for serialisation. UTF-8 is a
fantastic wire format, and it's great on disk, but the space-saving
advantages are less important once you are in-memory. If you are
dealing with large-enough strings for the trade-off not to be worth
it, you can probably hack having to deal with a different
representation.

One of the things I'd like to see BitC class languages used for is
writing relational and RDF databases, where the user typically wants
to configure collation and the encoding of the on-disk format. For
these sort of situations, 60% even seems a bit low. Outside of these
special cases, though, strings are probably mostly in the cache or
mostly out of cache, and the small conceptual increase in cache size
that UTF-8 would provide could be outweighed by other factors.

>> - for
>> large strings, you are probably more interested in using a stream
>> decoder then a great big in-memory string, and if that doesn't suit
>> your use case, you probably want to implement your own string type,
>> whether that be ropes or an array in utf-8 or whatever.
>
> Possibly, and perhaps, but if so then you aren't concerned about the native
> string representation.

Exactly: I'm saying that the needs of particular cases like SAX
parsing and joins between many-gig tables need not dictate the native
string representation, because there are better ways to optimise for
such cases.

-- 
William Leslie
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to