On Thu, Aug 29, 2013 at 3:25 AM, Jonathan S. Shapiro <[email protected]>wrote:
> On Wed, Aug 28, 2013 at 6:25 AM, Bennie Kloosteman <[email protected]>wrote: > >> ...The fact that 90% of strings are 0X00 0x?? 0x00 0x?? etc seems >> monumentally wastefull even for foreign languages .. >> > > That's an amazingly western-centric view, and it's flatly contradicted by > actual data. > I live in china ......and posted the data before ... look how much javascript , postscript ,and html keyword are in pages .. and even xml and json types , messages etc etc . I pulled down 20 foreign web sites and they were nearly all over 80% ASCII because of the influence in english on software.. re western view do you know most chinese strings are 1/3 the size because they make use of the full 16 bits anyway ! And Unicode is not even official here , officially you should use ASCII and then use an encoding scheme GB or GBK ( Unicode cant do newer chars so they do this encoding on unicode anyway and suffer a double whamy because the encoded chars are wider) . > > I'm in favor of UTF8 strings, and also of "chunky" strings in which > sub-runs are encoded using the most efficient encoding for the run. Those > are a lot harder to implement correctly than you might believe. > I know most of the issue we have discussed them earlier in bitc .. i even had a look at converting mono to UTF8 in the string impl but there were too many native and usafe hooks it made it too hard. > > The problem with UTF8 strings is that they do not index efficiently. s[i] > becomes an O(log n) operation rather than an O(1) operation. For sequential > access you can fix that with an iteration helper class, but not all access > is sequential. The same problem exists for strings having mixed formats. > If you know its ASCII it indexes fast .. most strings are very small and SIMD can scan 32 characters at a time for 0x10 , so you can quickly build an offset index. I bet nearly all indexing is on english chars anyway ... You may say long strings but nearly all long strings are UTF-8 !. > > >> Pretty much 60% of the data moved around or compared for most string >> operations is a huge win over C# and Java . Most web sites are UTF8-ASCII >> and even foreign web sites are 80-90% ASCII . >> Think middle tier performance json , xml etc etc , Maybe enough to lift >> mono over those products. >> > > The proportion of in-heap string data has grown since I last saw > comprehensive measurements, and for applications like DOM trees it's a big > part of the total live working set. But data copies are *not* the > dominant issue in performance in such applications. Data indexing is. This > is why IBM's ICU library is so important. It reconciles all of the > conflicting definitions of indexing methods and implements the classes that > make the reconciliation possible. > > 1.Reducing a heap size by 35% does affect performance , if its paging it increases it a lot , you also increase cache performance. On hand held teh memory saving allows better algos to be used for the rest of the app. 2.You cant index Asian chars on Unicode anyway as there is not a 1:1 Unicode to char relationship ( see encoding above ) , they nearly always use additional libs .. so by using Unicode you hurt most asian languages , since they build custom encoding ontop of unicode , you hurt western european performance..including english chars embedded in the language of every one . You benefit letter ( not character) based non european languages provided the english content is not too hiigh. 3. If your doing real intensive mutable indexing , your not working with C# immutable strings and likely working with char array and tree based representations. re DOM trees the Rust guys know a lot about DOM trees since they are the firefox team and use UTF-8 as for strings in rust. ( and USC-4 for chars). 4. To do the indexing in most cases yoru doing an O(n) scan anyway so no difference eg find first "<node> then find next </node> then subtract is identical between UTF8 and UTF16 . Performance is only significantly different if you go get the 1000 th character after the index of node. Also the string can use a bit after a complete scan ( or if indicated on construction) indicating its ascii and eliminate the escape check . Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
