Re: [Ironruby-core] Bytes or Characters?

Charles Oliver Nutter Fri, 08 Aug 2008 14:09:24 -0700

Tomas Matousek wrote:

The content representation is changed based upon operations that are performed 
on the mutable string. There is currently no limit on number of content type 
switches, so if one alternates binary and textual operations the conversion 
will take place for each one of them. Although this shouldn't be a common case 
we may consider to add some counters and keep the representation binary/textual 
based upon their values.

Ok, so what constitutes a binary operation and what consitutes a textualoperation? It seems like the potential for ping-ponging between the tworepresentations would be a serious risk. And largely that's why we endedup going with a single representation, since so many APIs did passString around, manipulate them, index specific characters, write themthrough some stream to somewhere else, and repeat.

If course if the ping-pong isn't bad there could probably be someformalized list of rules. Such a set of "binary" operations and"textual" operations would be useful to JRuby and MacRuby, in additionto IronRuby.

Here's an example we ran into, however: regexp matching against binarycontent. I know of at least one library that uses regexp to parse out abinary file header. How would this work under IronRuby? Also, there'sthe concern about conversion from binary to text at inopportune moments,which could for example corrupt binary content that could not be decodedinto valid UTF-16 characters. In our case, long ago, we represented allsuch binary content as "plain-encoded" UTF-16 with only the low byteset, but that obviously wasn't a whole lot better than just using bytes,and it was additionally way slower.

I imagine this would also impact copy-on-write capabilities too, yes?Since there would be operations that could completely change the backingstore of a string.

The design assumes that the nature of operations implemented by library methods 
is of two kinds: textual and binary. And that data that are once treated as 
text are not usually treated as raw binary data later. Any text in the IronRuby 
runtime is represented as a sequence of 16bit Unicode characters (standard .NET 
representation). Each binary data treated as text is converted to this 
representation, regardless of the encoding used for storage representation in 
the file. The encoding is remembered in the MutableString instance and the 
original representation could be always recreated. Not all Unicode characters 
fit into 16 bits, therefore some exotic ones are represented by multiple 
characters (surrogates). If there is such a character in the string, some 
operations (e.g. indexing) might not be precise anymore - the n-th item in the 
char[] isn't the n-th Unicode character in the string. We believe this 
impreciseness is not a real world issue and is worth performance gain and

 mplementation simplicity.

I guess one obvious question here would be supporting multipleencodings, as in Ruby 1.9. With a byte[]-based string and JOni(Oniguruma port) it shouldn't be too difficult to add 1.9 string logicinto JRuby. But it seems like it would be harder if we put in place thesame rules you have for converting text into the platform's preferredformat under certain circumstances.


- Charlie
_______________________________________________
Ironruby-core mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ironruby-core

Re: [Ironruby-core] Bytes or Characters?

Reply via email to