It's not to sat that binary is not supported at all. It's about the inconsistencies.
________________________________ Von: Jens Geyer Gesendet: 18.10.2012 08:59 An: [email protected] Betreff: AW: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields are re-encoded Hi Nathan, I referred to the binary data type. This one is in a semi-defined state, and IIRC there exist slightly different ways how binary fields are handled throughout the various languages. A ticket with some discussion already exists, can't remember the number right now. The web site says ( or did say in the past) something to the effect that that binary is intended to become a fully supported real data type some day in the future. Greetings, Jens ________________________________ Von: Nathan Beyer Gesendet: 18.10.2012 03:13 An: [email protected] Betreff: Re: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields are re-encoded Jens, Can you elaborate on the clean up that you'd like to see? I missed your previous comments, but would like to read them if you can put them up again or post some links. On Fri, Oct 12, 2012 at 2:03 AM, Jens Geyer <[email protected]> wrote: > Again I would propose to clean that up in general. Same confusion happens at > other places too. > > $0,02 > Jens > ________________________________ > Von: XB (JIRA) > Gesendet: 12.10.2012 00:15 > An: [email protected] > Betreff: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" > fields are re-encoded > > > [ > https://issues.apache.org/jira/browse/THRIFT-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474556#comment-13474556 > ] > > XB commented on THRIFT-1727: > ---------------------------- > > The problem lies within the confusion between thrift "binary" fields and > thrift "string" fields. Typically, "string" fields (sequences of characters) > are reduced to "binary" fields (sequence of bytes). However, in the current > Thrift Ruby implementation, the converse is done: sequences of bytes are > reduced to sequences of characters (represented by the Ruby String type). The > problem is that the higher layers do the reduction from sequences of bytes to > sequences of characters, but the lower layers do not recognize sequences of > bytes at all anymore, and just know sequences of characters. Then, these > sequences of characters are in turn converted into sequences of bytes (for > the wire-format). > > This patch solves the problem: > > {noformat} > diff --git a/lib/rb/ext/binary_protocol_accelerated.c > b/lib/rb/ext/binary_protocol_accelerated.c > index a8ebe7f..b75acf5 100644 > --- a/lib/rb/ext/binary_protocol_accelerated.c > +++ b/lib/rb/ext/binary_protocol_accelerated.c > @@ -81,7 +81,7 @@ static void write_string_direct(VALUE trans, VALUE str) { > if (TYPE(str) != T_STRING) { > rb_raise(rb_eStandardError, "Value should be a string"); > } > - str = convert_to_utf8_byte_buffer(str); > + str = convert_to_byte_buffer(str); > write_i32_direct(trans, RSTRING_LEN(str)); > rb_funcall(trans, write_method_id, 1, str); > } > diff --git a/lib/rb/ext/bytes.c b/lib/rb/ext/bytes.c > index 8a6fac4..d767cf4 100644 > --- a/lib/rb/ext/bytes.c > +++ b/lib/rb/ext/bytes.c > @@ -27,8 +27,8 @@ VALUE force_binary_encoding(VALUE buffer) { > return rb_funcall(thrift_bytes_module, force_binary_encoding_id, 1, > buffer); > } > > -VALUE convert_to_utf8_byte_buffer(VALUE string) { > - return rb_funcall(thrift_bytes_module, convert_to_utf8_byte_buffer_id, 1, > string); > +VALUE convert_to_byte_buffer(VALUE string) { > + return rb_funcall(thrift_bytes_module, convert_to_byte_buffer_id, 1, > string); > } > > VALUE convert_to_string(VALUE utf8_buffer) { > diff --git a/lib/rb/ext/bytes.h b/lib/rb/ext/bytes.h > index 7108d83..a74b2c4 100644 > --- a/lib/rb/ext/bytes.h > +++ b/lib/rb/ext/bytes.h > @@ -27,5 +27,5 @@ > */ > > VALUE force_binary_encoding(VALUE buffer); > -VALUE convert_to_utf8_byte_buffer(VALUE string); > +VALUE convert_to_byte_buffer(VALUE string); > VALUE convert_to_string(VALUE utf8_buffer); > diff --git a/lib/rb/ext/compact_protocol.c b/lib/rb/ext/compact_protocol.c > index 0c05481..513c379 100644 > --- a/lib/rb/ext/compact_protocol.c > +++ b/lib/rb/ext/compact_protocol.c > @@ -306,7 +306,7 @@ VALUE rb_thrift_compact_proto_write_double(VALUE self, > VALUE dub) { > > VALUE rb_thrift_compact_proto_write_string(VALUE self, VALUE str) { > VALUE transport = GET_TRANSPORT(self); > - str = convert_to_utf8_byte_buffer(str); > + str = convert_to_byte_buffer(str); > write_varint32(transport, RSTRING_LEN(str)); > WRITE(transport, RSTRING_PTR(str), RSTRING_LEN(str)); > return Qnil; > diff --git a/lib/rb/ext/constants.h b/lib/rb/ext/constants.h > index 3bfac88..499734c 100644 > --- a/lib/rb/ext/constants.h > +++ b/lib/rb/ext/constants.h > @@ -77,7 +77,7 @@ extern ID read_all_method_id; > extern ID read_into_buffer_method_id; > extern ID native_qmark_method_id; > extern ID force_binary_encoding_id; > -extern ID convert_to_utf8_byte_buffer_id; > +extern ID convert_to_byte_buffer_id; > extern ID convert_to_string_id; > > extern ID fields_const_id; > diff --git a/lib/rb/ext/thrift_native.c b/lib/rb/ext/thrift_native.c > index f066d6c..292ef17 100644 > --- a/lib/rb/ext/thrift_native.c > +++ b/lib/rb/ext/thrift_native.c > @@ -93,7 +93,7 @@ ID read_all_method_id; > ID read_into_buffer_method_id; > ID native_qmark_method_id; > ID force_binary_encoding_id; > -ID convert_to_utf8_byte_buffer_id; > +ID convert_to_byte_buffer_id; > ID convert_to_string_id; > > // constant ids > @@ -180,7 +180,7 @@ void Init_thrift_native() { > read_into_buffer_method_id = rb_intern("read_into_buffer"); > native_qmark_method_id = rb_intern("native?"); > force_binary_encoding_id = rb_intern("force_binary_encoding"); > - convert_to_utf8_byte_buffer_id = rb_intern("convert_to_utf8_byte_buffer"); > + convert_to_byte_buffer_id = rb_intern("convert_to_byte_buffer"); > convert_to_string_id = rb_intern("convert_to_string"); > > // constant ids > diff --git a/lib/rb/lib/thrift/bytes.rb b/lib/rb/lib/thrift/bytes.rb > index efd4f64..b10d966 100644 > --- a/lib/rb/lib/thrift/bytes.rb > +++ b/lib/rb/lib/thrift/bytes.rb > @@ -69,21 +69,27 @@ module Thrift > string.setbyte(index, byte) > end > > - # Converts the given String to a UTF-8 byte buffer. > + # Converts the given String to a UTF-8 byte buffer if the string is not > + # already binary. If it is already binary, do nothing. > # > # string - The String to convert. > # > - # Returns a new String with BINARY encoding, containing the UTF-8 > - # bytes of the original string. > - def self.convert_to_utf8_byte_buffer(string) > - if string.encoding != Encoding::UTF_8 > - # transcode to UTF-8 > - string = string.encode(Encoding::UTF_8) > + # Returns a String with BINARY encoding (either a new string containing > + # the UTF-8 bytes of the original string or the old string if it was > + # already binary) > + def self.convert_to_byte_buffer(string) > + if string.encoding != Encoding::BINARY > + if string.encoding != Encoding::UTF_8 > + # transcode to UTF-8 > + string = string.encode(Encoding::UTF_8) > + else > + # encoding is already UTF-8, but a duplicate is needed > + string = string.dup > + end > + string.force_encoding(Encoding::BINARY) > else > - # encoding is already UTF-8, but a duplicate is needed > - string = string.dup > + string > end > - string.force_encoding(Encoding::BINARY) > end > > # Converts the given UTF-8 byte buffer into a String > @@ -116,14 +122,14 @@ module Thrift > string[index] = byte > end > > - def self.convert_to_utf8_byte_buffer(string) > + def self.convert_to_byte_buffer(string) > # This assumes $KCODE is 'UTF8'/'U', which would mean the String is > already a UTF-8 byte buffer > # TODO consider handling other $KCODE values and transcoding with > iconv > string > end > > def self.convert_to_string(utf8_buffer) > - # See comment in 'convert_to_utf8_byte_buffer' for relevant > assumptions. > + # See comment in 'convert_to_byte_buffer' for relevant assumptions. > utf8_buffer > end > end > diff --git a/lib/rb/lib/thrift/protocol/binary_protocol.rb > b/lib/rb/lib/thrift/protocol/binary_protocol.rb > index 2528276..d227bdd 100644 > --- a/lib/rb/lib/thrift/protocol/binary_protocol.rb > +++ b/lib/rb/lib/thrift/protocol/binary_protocol.rb > @@ -107,7 +107,7 @@ module Thrift > end > > def write_string(str) > - str = Bytes.convert_to_utf8_byte_buffer(str) > + str = Bytes.convert_to_byte_buffer(str) > write_i32(str.length) > trans.write(str) > end > diff --git a/lib/rb/lib/thrift/protocol/compact_protocol.rb > b/lib/rb/lib/thrift/protocol/compact_protocol.rb > index 758e1ae..c813561 100644 > --- a/lib/rb/lib/thrift/protocol/compact_protocol.rb > +++ b/lib/rb/lib/thrift/protocol/compact_protocol.rb > @@ -210,7 +210,7 @@ module Thrift > end > > def write_string(str) > - str = Bytes.convert_to_utf8_byte_buffer(str) > + str = Bytes.convert_to_byte_buffer(str) > write_varint32(str.length) > @trans.write(str) > end > diff --git a/lib/rb/spec/bytes_spec.rb b/lib/rb/spec/bytes_spec.rb > index b82e304..863c1bb 100644 > --- a/lib/rb/spec/bytes_spec.rb > +++ b/lib/rb/spec/bytes_spec.rb > @@ -64,12 +64,12 @@ describe Thrift::Bytes do > end > end > > - describe '.convert_to_utf8_byte_buffer' do > + describe '.convert_to_byte_buffer' do > it 'should convert UTF-8 String to byte buffer' do > e = "\u20AC".encode('UTF-8') # a string with euro sign character > U+20AC > e.length.should == 1 > > - a = Thrift::Bytes.convert_to_utf8_byte_buffer e > + a = Thrift::Bytes.convert_to_byte_buffer e > a.encoding.should == Encoding::BINARY > a.length.should == 3 > a.unpack('C*').should == [0xE2, 0x82, 0xAC] > @@ -81,7 +81,7 @@ describe Thrift::Bytes do > e.length.should == 1 > e.unpack('C*').should == [0xA4] # euro sign is a different code > point in ISO-8859-15 > > - a = Thrift::Bytes.convert_to_utf8_byte_buffer e > + a = Thrift::Bytes.convert_to_byte_buffer e > a.encoding.should == Encoding::BINARY > a.length.should == 3 > a.unpack('C*').should == [0xE2, 0x82, 0xAC] > @@ -139,10 +139,10 @@ describe Thrift::Bytes do > end > end > > - describe '.convert_to_utf8_byte_buffer' do > + describe '.convert_to_byte_buffer' do > it 'should be a no-op' do > e = 'STRING' > - a = Thrift::Bytes.convert_to_utf8_byte_buffer e > + a = Thrift::Bytes.convert_to_byte_buffer e > a.should == e > a.should be(e) > end > {noformat} > > Basically, the method convert_to_utf8_byte_buffer() assumes that all Strings > it gets passed are just sequences of characters, which is wrong. Strings > which are sequences of bytes should have encoding Encoding::BINARY. If this > is the case, then the method convert_to_utf8_byte_buffer() should not perform > any conversion, because the string is already in the desired format. The > patch achieves that. However, the naming of the method is also not correct > anymore then, thus, it is renamed to make it clear that the result of this > method is not always a UTF-8 encoded string. (But it also does not need to be > one). > >> Ruby-1.9: data loss: "binary" fields are re-encoded >> --------------------------------------------------- >> >> Key: THRIFT-1727 >> URL: https://issues.apache.org/jira/browse/THRIFT-1727 >> Project: Thrift >> Issue Type: Bug >> Components: Ruby - Library >> Affects Versions: 0.9 >> Environment: JRuby 1.6.8 using "--1.9" command line parameter. >> Reporter: XB >> >> When setting a binary field of a Thrift object with some binary data (e.g. a >> string whose encoding is "ASCII-8BIT") and then serializing this object, the >> binary data is re-encoded. That is, it is encoded as if it were not a >> sequence of bytes but a sequence of characters, encoded using the ISO-8859-1 >> encoding. This assumed ISO-8859-1 sequence of characters is then converted >> into UTF-8 (by BinaryProtocol or CompactProtocol). This basically means that >> all bytes whose values are between 0x80 (inclusive) and 0x100 (exclusive) >> are converted into multi-byte sequences. This leads to data corruption. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA administrators > For more information on JIRA, see: http://www.atlassian.com/software/jira
