>How can we attribute the performance difference between these xml >parsers to encoding? Where are the benchmarks?
I would love some bench marks also but the operations for most things are very simple and I did see some a C UTF8 XML parser beat a UCS-2 Java one in my last job ( for XML) by 80%. Which IMHO is not surprising for benchmarks since - Benchmarks test just string/XML parsing - Your working with double the data = more memory , more cache - The operations are logically very simple so it is just mostly memory reading and writing. Hence the performance for these operations should be close to 100% better. Though it is correct on how much impact it has on the average program. > >Memory usage of strings probably isn't as important as you think - for >large strings, you are probably more interested in using a stream >decoder then a great big in-memory string, and if that doesn't suit >your use case, you probably want to implement your own string type, >whether that be ropes or an array in utf-8 or whatever. > Agree about very large string but I don't see many of these . Most objects these days seems to be stuffed with 5-30 character strings and many parsing like XML and HTML load the entire page/request in one hit and then do thousands of them. All this work would have a much lower memory foot print on UTF-8. Here is an interesting and relevant thread http://mail.nl.linux.org/linux-utf8/2000-08/msg00043.html and it states for most DBs 70% of the data is string data. >For typical in-memory string manipulation, UCS-2 has served us well, I think this is just because UCS-2 was the standard at the time and it was intended that documents use it . This didn't happen and UCS-2 is now a legacy standard and is regarded as superseded by UTF-16 ( which is variable length) and UTF-4 for fixed with . The question is why should a new system support a legacy standard built on incorrect assumptions. Most Asian chars can't be represented in UCS-2 making it probably worse than the old Ascii encodings still in common use in Asia. IMHO just because something is used in the past doesn't make it good in the future by this reasoning all new OS would be based on Win32 :-) Note using UTF-8 is a left field suggestion but it is important to question the conditions the choice was made at the time and it MAY give BitC quite a boost in terms of performance and memory especially for embedded systems. I would also say the old ascii encodings like Big8 has also served us well and are still common despite UCS-2 OS. >and people usually work under the assumption that indexing or slicing >a string by index-of-codepoint is O(1) (even if the strings resulting >from the slice may not be valid). I think it is a useful assumption, >and that programmers will continue to want cheap slices based on a >vague if sometimes incorrect count of characters for the time being. Correct people have this assumption which is why an underlying string based on UTF-16 or UTF-8 is bad..and a string supporting indexing as a standard would block these encodings. A lib can hide this and use O(1) for most ops by using byte based indexing. So I would see this working as - The lib uses byte index offsets and provides most API calls as not using indexes eg Replace etc. - 95% of devs use the lib functions - Developers can call GetCharArray to return a copy of the internal UTF8 array for mutable work. - Developers can call GetFixedCharArray to return a UCS-2 array for mutable fixed char work and legacy support. The other option for the API is like Perl where it is stored in UTF-8 but it hides the fact from the user so an index would be an O(n) . > >As for immutability, I don't see what that has to do with indexing or >slicing or encoding. Immutable strings are non-optional in any sane >modern language. Agree they should be immutable , I was just covering the option of uses indexes to changes the string which C allows. ( And note converting to and from an array does allow this for legacy code) Ben _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
