Re: AW: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields are re-encoded

Nathan Beyer Fri, 19 Oct 2012 19:46:45 -0700

Oh, I thought you were talking about it in reference to the Ruby lib only.

On Oct 18, 2012, at 1:59 AM, Jens Geyer <[email protected]> wrote:


> It's not to sat that binary is not supported at all. It's about the 
> inconsistencies.
> 
> ________________________________
> Von: Jens Geyer
> Gesendet: 18.10.2012 08:59
> An: [email protected]
> Betreff: AW: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" 
> fields are re-encoded
> 
> Hi Nathan,
> 
> I referred to the binary data type.
> 
> This one is in a semi-defined state, and IIRC there exist slightly different 
> ways how binary fields are handled throughout the various languages. A ticket 
> with some discussion already exists, can't remember the number right now.
> 
> The web site says ( or did say in the past) something to the effect that that 
> binary is intended to become a fully supported real data type some day in the 
> future.
> 
> Greetings,
> Jens
> ________________________________
> Von: Nathan Beyer
> Gesendet: 18.10.2012 03:13
> An: [email protected]
> Betreff: Re: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" 
> fields are re-encoded
> 
> Jens, Can you elaborate on the clean up that you'd like to see? I
> missed your previous comments, but would like to read them if you can
> put them up again or post some links.
> 
> On Fri, Oct 12, 2012 at 2:03 AM, Jens Geyer <[email protected]> wrote:
>> Again I would propose to clean that up in general. Same confusion happens at 
>> other places too.
>> 
>> $0,02
>> Jens
>> ________________________________
>> Von: XB (JIRA)
>> Gesendet: 12.10.2012 00:15
>> An: [email protected]
>> Betreff: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" 
>> fields are re-encoded
>> 
>> 
>>    [ 
>> https://issues.apache.org/jira/browse/THRIFT-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474556#comment-13474556
>>  ]
>> 
>> XB commented on THRIFT-1727:
>> ----------------------------
>> 
>> The problem lies within the confusion between thrift "binary" fields and 
>> thrift "string" fields. Typically, "string" fields (sequences of characters) 
>> are reduced to "binary" fields (sequence of bytes). However, in the current 
>> Thrift Ruby implementation, the converse is done: sequences of bytes are 
>> reduced to sequences of characters (represented by the Ruby String type). 
>> The problem is that the higher layers do the reduction from sequences of 
>> bytes to sequences of characters, but the lower layers do not recognize 
>> sequences of bytes at all anymore, and just know sequences of characters. 
>> Then, these sequences of characters are in turn converted into sequences of 
>> bytes (for the wire-format).
>> 
>> This patch solves the problem:
>> 
>> {noformat}
>> diff --git a/lib/rb/ext/binary_protocol_accelerated.c 
>> b/lib/rb/ext/binary_protocol_accelerated.c
>> index a8ebe7f..b75acf5 100644
>> --- a/lib/rb/ext/binary_protocol_accelerated.c
>> +++ b/lib/rb/ext/binary_protocol_accelerated.c
>> @@ -81,7 +81,7 @@ static void write_string_direct(VALUE trans, VALUE str) {
>>   if (TYPE(str) != T_STRING) {
>>     rb_raise(rb_eStandardError, "Value should be a string");
>>   }
>> -  str = convert_to_utf8_byte_buffer(str);
>> +  str = convert_to_byte_buffer(str);
>>   write_i32_direct(trans, RSTRING_LEN(str));
>>   rb_funcall(trans, write_method_id, 1, str);
>> }
>> diff --git a/lib/rb/ext/bytes.c b/lib/rb/ext/bytes.c
>> index 8a6fac4..d767cf4 100644
>> --- a/lib/rb/ext/bytes.c
>> +++ b/lib/rb/ext/bytes.c
>> @@ -27,8 +27,8 @@ VALUE force_binary_encoding(VALUE buffer) {
>>   return rb_funcall(thrift_bytes_module, force_binary_encoding_id, 1, 
>> buffer);
>> }
>> 
>> -VALUE convert_to_utf8_byte_buffer(VALUE string) {
>> -  return rb_funcall(thrift_bytes_module, convert_to_utf8_byte_buffer_id, 1, 
>> string);
>> +VALUE convert_to_byte_buffer(VALUE string) {
>> +  return rb_funcall(thrift_bytes_module, convert_to_byte_buffer_id, 1, 
>> string);
>> }
>> 
>> VALUE convert_to_string(VALUE utf8_buffer) {
>> diff --git a/lib/rb/ext/bytes.h b/lib/rb/ext/bytes.h
>> index 7108d83..a74b2c4 100644
>> --- a/lib/rb/ext/bytes.h
>> +++ b/lib/rb/ext/bytes.h
>> @@ -27,5 +27,5 @@
>>  */
>> 
>> VALUE force_binary_encoding(VALUE buffer);
>> -VALUE convert_to_utf8_byte_buffer(VALUE string);
>> +VALUE convert_to_byte_buffer(VALUE string);
>> VALUE convert_to_string(VALUE utf8_buffer);
>> diff --git a/lib/rb/ext/compact_protocol.c b/lib/rb/ext/compact_protocol.c
>> index 0c05481..513c379 100644
>> --- a/lib/rb/ext/compact_protocol.c
>> +++ b/lib/rb/ext/compact_protocol.c
>> @@ -306,7 +306,7 @@ VALUE rb_thrift_compact_proto_write_double(VALUE self, 
>> VALUE dub) {
>> 
>> VALUE rb_thrift_compact_proto_write_string(VALUE self, VALUE str) {
>>   VALUE transport = GET_TRANSPORT(self);
>> -  str = convert_to_utf8_byte_buffer(str);
>> +  str = convert_to_byte_buffer(str);
>>   write_varint32(transport, RSTRING_LEN(str));
>>   WRITE(transport, RSTRING_PTR(str), RSTRING_LEN(str));
>>   return Qnil;
>> diff --git a/lib/rb/ext/constants.h b/lib/rb/ext/constants.h
>> index 3bfac88..499734c 100644
>> --- a/lib/rb/ext/constants.h
>> +++ b/lib/rb/ext/constants.h
>> @@ -77,7 +77,7 @@ extern ID read_all_method_id;
>> extern ID read_into_buffer_method_id;
>> extern ID native_qmark_method_id;
>> extern ID force_binary_encoding_id;
>> -extern ID convert_to_utf8_byte_buffer_id;
>> +extern ID convert_to_byte_buffer_id;
>> extern ID convert_to_string_id;
>> 
>> extern ID fields_const_id;
>> diff --git a/lib/rb/ext/thrift_native.c b/lib/rb/ext/thrift_native.c
>> index f066d6c..292ef17 100644
>> --- a/lib/rb/ext/thrift_native.c
>> +++ b/lib/rb/ext/thrift_native.c
>> @@ -93,7 +93,7 @@ ID read_all_method_id;
>> ID read_into_buffer_method_id;
>> ID native_qmark_method_id;
>> ID force_binary_encoding_id;
>> -ID convert_to_utf8_byte_buffer_id;
>> +ID convert_to_byte_buffer_id;
>> ID convert_to_string_id;
>> 
>> // constant ids
>> @@ -180,7 +180,7 @@ void Init_thrift_native() {
>>   read_into_buffer_method_id = rb_intern("read_into_buffer");
>>   native_qmark_method_id = rb_intern("native?");
>>   force_binary_encoding_id = rb_intern("force_binary_encoding");
>> -  convert_to_utf8_byte_buffer_id = rb_intern("convert_to_utf8_byte_buffer");
>> +  convert_to_byte_buffer_id = rb_intern("convert_to_byte_buffer");
>>   convert_to_string_id = rb_intern("convert_to_string");
>> 
>>   // constant ids
>> diff --git a/lib/rb/lib/thrift/bytes.rb b/lib/rb/lib/thrift/bytes.rb
>> index efd4f64..b10d966 100644
>> --- a/lib/rb/lib/thrift/bytes.rb
>> +++ b/lib/rb/lib/thrift/bytes.rb
>> @@ -69,21 +69,27 @@ module Thrift
>>         string.setbyte(index, byte)
>>       end
>> 
>> -      # Converts the given String to a UTF-8 byte buffer.
>> +      # Converts the given String to a UTF-8 byte buffer if the string is 
>> not
>> +      # already binary. If it is already binary, do nothing.
>>       #
>>       # string - The String to convert.
>>       #
>> -      # Returns a new String with BINARY encoding, containing the UTF-8
>> -      # bytes of the original string.
>> -      def self.convert_to_utf8_byte_buffer(string)
>> -        if string.encoding != Encoding::UTF_8
>> -          # transcode to UTF-8
>> -          string = string.encode(Encoding::UTF_8)
>> +      # Returns a String with BINARY encoding (either a new string 
>> containing
>> +      # the UTF-8 bytes of the original string or the old string if it was
>> +      # already binary)
>> +      def self.convert_to_byte_buffer(string)
>> +        if string.encoding != Encoding::BINARY
>> +          if string.encoding != Encoding::UTF_8
>> +            # transcode to UTF-8
>> +            string = string.encode(Encoding::UTF_8)
>> +          else
>> +            # encoding is already UTF-8, but a duplicate is needed
>> +            string = string.dup
>> +          end
>> +          string.force_encoding(Encoding::BINARY)
>>         else
>> -          # encoding is already UTF-8, but a duplicate is needed
>> -          string = string.dup
>> +          string
>>         end
>> -        string.force_encoding(Encoding::BINARY)
>>       end
>> 
>>       # Converts the given UTF-8 byte buffer into a String
>> @@ -116,14 +122,14 @@ module Thrift
>>         string[index] = byte
>>       end
>> 
>> -      def self.convert_to_utf8_byte_buffer(string)
>> +      def self.convert_to_byte_buffer(string)
>>         # This assumes $KCODE is 'UTF8'/'U', which would mean the String is 
>> already a UTF-8 byte buffer
>>         # TODO consider handling other $KCODE values and transcoding with 
>> iconv
>>         string
>>       end
>> 
>>       def self.convert_to_string(utf8_buffer)
>> -        # See comment in 'convert_to_utf8_byte_buffer' for relevant 
>> assumptions.
>> +        # See comment in 'convert_to_byte_buffer' for relevant assumptions.
>>         utf8_buffer
>>       end
>>     end
>> diff --git a/lib/rb/lib/thrift/protocol/binary_protocol.rb 
>> b/lib/rb/lib/thrift/protocol/binary_protocol.rb
>> index 2528276..d227bdd 100644
>> --- a/lib/rb/lib/thrift/protocol/binary_protocol.rb
>> +++ b/lib/rb/lib/thrift/protocol/binary_protocol.rb
>> @@ -107,7 +107,7 @@ module Thrift
>>     end
>> 
>>     def write_string(str)
>> -      str = Bytes.convert_to_utf8_byte_buffer(str)
>> +      str = Bytes.convert_to_byte_buffer(str)
>>       write_i32(str.length)
>>       trans.write(str)
>>     end
>> diff --git a/lib/rb/lib/thrift/protocol/compact_protocol.rb 
>> b/lib/rb/lib/thrift/protocol/compact_protocol.rb
>> index 758e1ae..c813561 100644
>> --- a/lib/rb/lib/thrift/protocol/compact_protocol.rb
>> +++ b/lib/rb/lib/thrift/protocol/compact_protocol.rb
>> @@ -210,7 +210,7 @@ module Thrift
>>     end
>> 
>>     def write_string(str)
>> -      str = Bytes.convert_to_utf8_byte_buffer(str)
>> +      str = Bytes.convert_to_byte_buffer(str)
>>       write_varint32(str.length)
>>       @trans.write(str)
>>     end
>> diff --git a/lib/rb/spec/bytes_spec.rb b/lib/rb/spec/bytes_spec.rb
>> index b82e304..863c1bb 100644
>> --- a/lib/rb/spec/bytes_spec.rb
>> +++ b/lib/rb/spec/bytes_spec.rb
>> @@ -64,12 +64,12 @@ describe Thrift::Bytes do
>>       end
>>     end
>> 
>> -    describe '.convert_to_utf8_byte_buffer' do
>> +    describe '.convert_to_byte_buffer' do
>>       it 'should convert UTF-8 String to byte buffer' do
>>         e = "\u20AC".encode('UTF-8') # a string with euro sign character 
>> U+20AC
>>         e.length.should == 1
>> 
>> -        a = Thrift::Bytes.convert_to_utf8_byte_buffer e
>> +        a = Thrift::Bytes.convert_to_byte_buffer e
>>         a.encoding.should == Encoding::BINARY
>>         a.length.should == 3
>>         a.unpack('C*').should == [0xE2, 0x82, 0xAC]
>> @@ -81,7 +81,7 @@ describe Thrift::Bytes do
>>         e.length.should == 1
>>         e.unpack('C*').should == [0xA4] # euro sign is a different code 
>> point in ISO-8859-15
>> 
>> -        a = Thrift::Bytes.convert_to_utf8_byte_buffer e
>> +        a = Thrift::Bytes.convert_to_byte_buffer e
>>         a.encoding.should == Encoding::BINARY
>>         a.length.should == 3
>>         a.unpack('C*').should == [0xE2, 0x82, 0xAC]
>> @@ -139,10 +139,10 @@ describe Thrift::Bytes do
>>       end
>>     end
>> 
>> -    describe '.convert_to_utf8_byte_buffer' do
>> +    describe '.convert_to_byte_buffer' do
>>       it 'should be a no-op' do
>>         e = 'STRING'
>> -        a = Thrift::Bytes.convert_to_utf8_byte_buffer e
>> +        a = Thrift::Bytes.convert_to_byte_buffer e
>>         a.should == e
>>         a.should be(e)
>>       end
>> {noformat}
>> 
>> Basically, the method convert_to_utf8_byte_buffer() assumes that all Strings 
>> it gets passed are just sequences of characters, which is wrong. Strings 
>> which are sequences of bytes should have encoding Encoding::BINARY. If this 
>> is the case, then the method convert_to_utf8_byte_buffer() should not 
>> perform any conversion, because the string is already in the desired format. 
>> The patch achieves that. However, the naming of the method is also not 
>> correct anymore then, thus, it is renamed to make it clear that the result 
>> of this method is not always a UTF-8 encoded string. (But it also does not 
>> need to be one).
>> 
>>> Ruby-1.9: data loss: "binary" fields are re-encoded
>>> ---------------------------------------------------
>>> 
>>>                Key: THRIFT-1727
>>>                URL: https://issues.apache.org/jira/browse/THRIFT-1727
>>>            Project: Thrift
>>>         Issue Type: Bug
>>>         Components: Ruby - Library
>>>   Affects Versions: 0.9
>>>        Environment: JRuby 1.6.8 using "--1.9" command line parameter.
>>>           Reporter: XB
>>> 
>>> When setting a binary field of a Thrift object with some binary data (e.g. 
>>> a string whose encoding is "ASCII-8BIT") and then serializing this object, 
>>> the binary data is re-encoded. That is, it is encoded as if it were not a 
>>> sequence of bytes but a sequence of characters, encoded using the 
>>> ISO-8859-1 encoding. This assumed ISO-8859-1 sequence of characters is then 
>>> converted into UTF-8 (by BinaryProtocol or CompactProtocol). This basically 
>>> means that all bytes whose values are between 0x80 (inclusive) and 0x100 
>>> (exclusive) are converted into multi-byte sequences. This leads to data 
>>> corruption.
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA administrators
>> For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: AW: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields are re-encoded

Reply via email to