Re: [Ironruby-core] Code Review: MutableString5

Tomas Matousek Sun, 11 May 2008 10:12:38 -0700

I thought about that, but given that there are like 15 overloads for Append it 
might be an unnecessary code duplication to add them for constructors as well.
You can do it on a single line too:


MutableString str = MutableString.CreateBinary(received).Append(buffer, 0, 
received);

Append returns the MutableString instance back and you can also specify 
estimated capacity to CreateBinary if you know it.

Let's use this for now and if the patter is very often let's consider adding 
more overloads.

Tomas

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Peter Bacon 
Darwin
Sent: Sunday, May 11, 2008 5:23 AM
To: ironruby-core@rubyforge.org
Subject: Re: [Ironruby-core] Code Review: MutableString5

One thing that MutableString could do with is
        public static MutableString/*!*/ CreateBinary(byte[]/*!*/ bytes, int 
start, int length) {
At the moment you have to do something like:
        MutableString str = MutableString.CreateBinary();
        str.Append(buffer, 0, received);
Pete

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tomas Matousek
Sent: Saturday,10 May 10, 2008 22:42
To: ironruby-core@rubyforge.org
Subject: Re: [Ironruby-core] Code Review: MutableString5

$KCODE is orthogonal to the encoding in MutableString. $KCODE seems to be just 
a value that is used by some library methods that perform binary operations on 
textual data. MutableString.Encoding is encoding of the representation. If a 
MutableString instance is created from .NET string an encoding that is 
associated with it is used whenever the string is consumed by a binary data 
operation. We could represent all strings as byte[], but then you'd need to 
convert .NET strings to byte[] at the construction time. MutableString allows 
you to be lazy and perhaps not perform the conversion at all if not needed.

Could you give some code sample that you think could be broken?

Tomas

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Peter Bacon 
Darwin
Sent: Saturday, May 10, 2008 2:27 AM
To: ironruby-core@rubyforge.org
Subject: Re: [Ironruby-core] Code Review: MutableString5

This is a big old diff to search through.  I couldn't work out a way of easily 
patching it onto my source at home due to the folder differences.
I really like this hybrid idea and it looks like it will work well.  I have one 
question with regards to encodings and KCODE.
I appreciate that String is changing between Ruby 1.8 and 1.9.  It appears that 
this MutableString implementation is leaning more toward the 1.9 implementation 
(i.e. holding on to an Encoding within the String itself).

1.8 does hold the encoding and as I understand it the implicit encoding of the 
bytes held in a String is driven off KCODE.  Is that correct?  If so you have a 
number of scenarios which I think could cause problems with MutableString 
holding on to its own Encoding, which stem from times when KCODE is changed at 
runtime.  I'll try to describe a concrete example and you can tell me where I 
am going wrong...

Assume that KCODE is set to UTF8.  If you create a String from an array of 
bytes in Ruby, the bytes are just stored as-is.  You can do stuff which is 
encoding dependent and UTF8 is assumed.
If you now change KCODE to say EUC, then the bytes in the String are unchanged 
but now encoding dependent operations will possibly produce different results 
on the same string since they interpret the bytes differently.
The worry I have with MutableString, is that if you create a string from bytes 
but then do an operation that requires it to be converted to a CLR string 
internally.  What happens when you change KCODE?  You can't simply change the 
Encoding value of the MutableString, since if you then access the bytes you 
will not get the same bytes back as were originally put in.  I suppose, on 
changing KCODE, you could go through all the strings in memory, which have been 
converted from binary to CLR strings, and convert them (i.e. back to bytes via 
the old encoding and then to CLR strings via the new encoding).  What would be 
the optimal solution in this case?

Again, I am not talking from a position of deep knowledge here so I may be 
missing something really obvious.  But I thought it was worth asking the 
question.

Regards,

Pete



From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tomas Matousek
Sent: Friday,09 May 09, 2008 19:08
To: IronRuby External Code Reviewers
Cc: ironruby-core@rubyforge.org
Subject: [Ironruby-core] Code Review: MutableString5


tfpt review /shelveset:MutableString5;REDMOND\tomat


A new implementation for Ruby MutableString and Ruby regular expression 
wrappers.
This is just the first pass, w/o optimizations and w/o encodings (Default 
system encoding is used for all strings).
Many improvements and adjustments will come in future, some hacks will be 
removed.

Basic architecture:
MutableString holds on Content and Encoding. Content is an abstract class that 
has three subclasses:

1)      StringContent

-          Holds on an instance of System.String - an immutable .NET string. 
This is the default representation for strings coming from CLR methods and for 
Ruby string literals.

-          A textual write operation on the mutable string that has this 
content representation will cause implicit conversion of the representation to 
StringBuilderContent.

-          A binary read/write operation triggers a transition to BinaryContent 
using the Encoding stored on the owning MutableString.



2)      StringBuilderContent

-          Holds on an instance of System.Text.StringBuilder - a mutable 
Unicode string.

-          A binary read/write operation transforms the content to 
BinaryContent representation.

-          StringBuilder is not optimal for some operations (requires 
unnecessary copying), we may consider to replace it with resizable char[].


3)      BinaryContent

-          A textual read/write operation transforms the content to 
StringBuilderContent representation.

-          List<byte> is currently used, but it doesn't fit many operations 
very well. We should replace it by resizable byte[].

The content representation is changed based upon operations that are performed 
on the mutable string. There is currently no limit on number of content type 
switches, so if one alternates binary and textual operations the conversion 
will take place for each one of them. Although this shouldn't be a common case 
we may consider to add some counters and keep the representation binary/textual 
based upon their values.

The design assumes that the nature of operations implemented by library methods 
is of two kinds: textual and binary. And that data that are once treated as 
text are not usually treated as raw binary data later. Any text in the IronRuby 
runtime is represented as a sequence of 16bit Unicode characters (standard .NET 
representation). Each binary data treated as text is converted to this 
representation, regardless of the encoding used for storage representation in 
the file. The encoding is remembered in the MutableString instance and the 
original representation could be always recreated. Not all Unicode characters 
fit into 16 bits, therefore some exotic ones are represented by multiple 
characters (surrogates). If there is such a character in the string, some 
operations (e.g. indexing) might not be precise anymore - the n-th item in the 
char[] isn't the n-th Unicode character in the string (there might be escape 
characters). We believe this impreciseness is not a real world issue and is 
worth performance gain and implementation simplicity.



Tomas

_______________________________________________
Ironruby-core mailing list
Ironruby-core@rubyforge.org
http://rubyforge.org/mailman/listinfo/ironruby-core

Re: [Ironruby-core] Code Review: MutableString5

Reply via email to