>>>
 >>> How can we attribute the performance difference between these xml
 >>> parsers to encoding? Where are the benchmarks?
 >>>
 >>> Memory usage of strings probably isn't as important as you think...
 >>
 >> I think this is incorrect. In UNIX programs circa 1990, 20% of live
 >> in-memory data on workstations was character string data. By 2000,
 >that
 >> number was closer to 60%. The proportion on servers is much higher. So
 >size
 >> of character representation matters both for memory usage reasons and
 >for
 >> cache bandwidth reasons - the latter probably more compelling than the
 >> former.
 >
 >I mean to say that the in-memory format should favour efficiency of
 >iteration and slicing rather than space efficiency. Space efficient
 >representations can be reserved for serialisation. UTF-8 is a
 >fantastic wire format, and it's great on disk, but the space-saving
 >advantages are less important once you are in-memory. If you are
 >dealing with large-enough strings for the trade-off not to be worth
 >it, you can probably hack having to deal with a different
 >representation.
 >

A partly agree but  you could say a typical object domain is 60-70% strings
yet only a few being used with many held in cache managers etc.  By using
UTF8 you save a lot a memory and you can 
1) Convert to fixed format before working on it in the rare case you need
to.
2) Use byte indexing.

In effect for some key cases (xml and html) you are skipping serialization
since the HTML and XML is directly represented and only convert to fixed
width when its needed ( so no loss) . eg you can read it directly out of and
parse it from a tcp socket.


 >One of the things I'd like to see BitC class languages used for is
 >writing relational and RDF databases, where the user typically wants
 >to configure collation and the encoding of the on-disk format. For
 >these sort of situations, 60% even seems a bit low. Outside of these
 >special cases, though, strings are probably mostly in the cache or
 >mostly out of cache, and the small conceptual increase in cache size
 >that UTF-8 would provide could be outweighed by other factors.

I agree you won't see much change in cache hit  ratios , that said the
amount of memory work ( reads/writes) is almost halved ( when using byte
indexing and an APi like findnext ( string , currentByteindex)  and
incrementIndex(currentByteindex , n ) ) ..And the less memory work the less
chance of stalls.

 >
 >>> - for
 >>> large strings, you are probably more interested in using a stream
 >>> decoder then a great big in-memory string, and if that doesn't suit
 >>> your use case, you probably want to implement your own string type,
 >>> whether that be ropes or an array in utf-8 or whatever.
 >>
 >> Possibly, and perhaps, but if so then you aren't concerned about the
 >native
 >> string representation.
 >
 >Exactly: I'm saying that the needs of particular cases like SAX
 >parsing and joins between many-gig tables need not dictate the native
 >string representation, because there are better ways to optimise for
 >such cases.

True but I think the UTF8 string and a tiny char[] subset lib via fn
GetFixedCharArray together can handle nearly all cases ( eg fast , memory
efficient and backward compatible) meaning more use of the standard lib
which is always good . I do agree there are few UTF-8 parsers out there due
to 16 bit wide chars being adopted in the early 90s and the greater
difficulty in variable length chars but they are fast. 

Anyway the alternatives of UTF16 - UCS-2 ( with "pages" or Java /C# style
2nd code to support UTF-16)  and UCS-4 are to me all more distasteful for a
new system/ language. 

Ben 

 >

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to