We have a hybrid representation that converts content lazily as needed. The 
code that's currently checked in is a basic implementation I coded in a day 
before RailsConf so it is pretty basic, is not tested thoroughly and has bunch 
of bugs I already know about. I'm working on some improvements right now.

Here's the checkin comment that explains briefly how it works. Note that some 
details are subject to change:

A new implementation for Ruby MutableString and Ruby regular expression 
wrappers.
This is just the first pass, w/o optimizations and w/o encodings (Default 
system encoding is used for all strings).
Many improvements and adjustments will come in future, some hacks will be 
removed.

Basic architecture:
MutableString holds on Content and Encoding. Content is an abstract class that 
has three subclasses:
1)      StringContent
-       Holds on an instance of System.String - an immutable .NET string. This 
is the default representation for strings coming from CLR methods and for Ruby 
string literals.
-       A textual write operation on the mutable string that has this content 
representation will cause implicit conversion of the representation to 
StringBuilderContent.
-       A binary read/write operation triggers a transition to BinaryContent 
using the Encoding stored on the owning MutableString.

2)      StringBuilderContent
-       Holds on an instance of System.Text.StringBuilder - a mutable Unicode 
string.
-       A binary read/write operation transforms the content to BinaryContent 
representation.
-       StringBuilder is not optimal for some operations (requires unnecessary 
copying), we may consider to replace it with resizable char[].

3)      BinaryContent
-       A textual read/write operation transforms the content to 
StringBuilderContent representation.
-       List<byte> is currently used, but it doesn't fit many operations very 
well. We should replace it by resizable byte[].

The content representation is changed based upon operations that are performed 
on the mutable string. There is currently no limit on number of content type 
switches, so if one alternates binary and textual operations the conversion 
will take place for each one of them. Although this shouldn't be a common case 
we may consider to add some counters and keep the representation binary/textual 
based upon their values.

The design assumes that the nature of operations implemented by library methods 
is of two kinds: textual and binary. And that data that are once treated as 
text are not usually treated as raw binary data later. Any text in the IronRuby 
runtime is represented as a sequence of 16bit Unicode characters (standard .NET 
representation). Each binary data treated as text is converted to this 
representation, regardless of the encoding used for storage representation in 
the file. The encoding is remembered in the MutableString instance and the 
original representation could be always recreated. Not all Unicode characters 
fit into 16 bits, therefore some exotic ones are represented by multiple 
characters (surrogates). If there is such a character in the string, some 
operations (e.g. indexing) might not be precise anymore - the n-th item in the 
char[] isn't the n-th Unicode character in the string. We believe this 
impreciseness is not a real world issue and is worth performance gain and i
 mplementation simplicity.

Tomas

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Charles Oliver 
Nutter
Sent: Thursday, August 07, 2008 3:18 PM
To: [email protected]
Subject: [Ironruby-core] Bytes or Characters?

Hey, I'm curious how IronRuby is handling the bytes versus characters
issue for Ruby strings. JRuby currently only has byte[]-based strings, a
decision we made mostly for Ruby performance. But it has obvious
implications for calling Java code, since we need to decode and encode
the byte[] to char[] on the way in and out. Ultimately the decision to
use byte[]-based strings was the right one, since so much of Ruby
expects byte counts and uses String as a generic byte bucket. But more
and more we've started to consider ways to hybridize String so it's
characters when we want it to be and bytes otherwise.

So, what does IronRuby do?

- Charlie
_______________________________________________
Ironruby-core mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ironruby-core

_______________________________________________
Ironruby-core mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ironruby-core

Reply via email to