>How can we attribute the performance difference between these xml
 >parsers to encoding? Where are the benchmarks?

I would love some bench marks also but the operations for most things are
very simple and I did see some a C UTF8 XML parser beat a UCS-2 Java one in
my last job ( for XML) by 80%. Which IMHO is not surprising for benchmarks
since 
- Benchmarks test just string/XML  parsing 
- Your working with double the data = more memory , more cache
- The operations are logically very simple so it is just mostly memory
reading and writing. 

Hence the performance for these operations should be close to 100% better.
Though it is correct on how much impact it has on the average program.  

>
 >Memory usage of strings probably isn't as important as you think - for
 >large strings, you are probably more interested in using a stream
 >decoder then a great big in-memory string, and if that doesn't suit
 >your use case, you probably want to implement your own string type,
 >whether that be ropes or an array in utf-8 or whatever.
 >

Agree about very large string but I don't see many of these . Most objects
these days seems to be stuffed with 5-30 character strings and many parsing
like XML and HTML load the entire page/request in one hit and then do
thousands of them. All this work would have a much lower memory foot print
on UTF-8. 




Here is an interesting and relevant thread
http://mail.nl.linux.org/linux-utf8/2000-08/msg00043.html and it states for
most DBs 70% of the data is string data. 

>For typical in-memory string manipulation, UCS-2 has served us well,

I think this is just because UCS-2 was the standard at the time and it was
intended that documents use it  .  This didn't happen and UCS-2 is now a
legacy standard and is regarded as superseded by UTF-16 ( which is variable
length) and UTF-4 for fixed with .  The question is why should a new system
support a legacy standard built on incorrect assumptions. 
Most Asian chars can't be represented in UCS-2 making it probably worse than
the old Ascii encodings still in common use in Asia. 

IMHO just because something is used in the past doesn't make it good in the
future by this reasoning all new OS would be based on Win32 :-) Note using
UTF-8 is a left field suggestion but it is important to question the
conditions the choice was made at the time and it MAY give BitC quite a
boost in terms of performance and memory especially for embedded systems. 

I would also say the old ascii encodings like Big8 has also served us well
and are still common despite UCS-2 OS. 

>and people usually work under the assumption that indexing or slicing
 >a string by index-of-codepoint is O(1) (even if the strings resulting
 >from the slice may not be valid). I think it is a useful assumption,
 >and that programmers will continue to want cheap slices based on a
 >vague if sometimes incorrect count of characters for the time being.

Correct people have this assumption which is why an underlying string based
on UTF-16 or UTF-8 is bad..and a string supporting indexing as a standard
would block these encodings. A lib can hide this and use O(1) for most ops
by using byte based indexing. 

So I would see this working as 
- The lib uses byte index offsets and provides most API calls as not using
indexes eg Replace etc. 
- 95% of devs use the lib functions
- Developers can call GetCharArray to return a copy of the internal UTF8
array for mutable work. 
- Developers can call GetFixedCharArray to return a UCS-2 array for mutable
fixed char work and legacy support.

The other option for the API is like Perl where it is stored in UTF-8 but it
hides the fact from the user so an index would be an O(n) .


 >
 >As for immutability, I don't see what that has to do with indexing or
 >slicing or encoding. Immutable strings are non-optional in any sane
 >modern language.

Agree they should be immutable , I was just covering the option of uses
indexes to changes the string which C allows. ( And note converting to and
from an array does allow this for legacy code) 

Ben 

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to