AW: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields are re-encoded

Jens Geyer Thu, 18 Oct 2012 00:02:16 -0700

It's not to sat that binary is not supported at all. It's about the 
inconsistencies.


________________________________
Von: Jens Geyer
Gesendet: 18.10.2012 08:59
An: [email protected]
Betreff: AW: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" 
fields are re-encoded

Hi Nathan,

I referred to the binary data type.

This one is in a semi-defined state, and IIRC there exist slightly different 
ways how binary fields are handled throughout the various languages. A ticket 
with some discussion already exists, can't remember the number right now.

The web site says ( or did say in the past) something to the effect that that 
binary is intended to become a fully supported real data type some day in the 
future.

Greetings,
Jens
________________________________
Von: Nathan Beyer
Gesendet: 18.10.2012 03:13
An: [email protected]
Betreff: Re: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" 
fields are re-encoded

Jens, Can you elaborate on the clean up that you'd like to see? I
missed your previous comments, but would like to read them if you can
put them up again or post some links.

On Fri, Oct 12, 2012 at 2:03 AM, Jens Geyer <[email protected]> wrote:
> Again I would propose to clean that up in general. Same confusion happens at 
> other places too.
>
> $0,02
> Jens
> ________________________________
> Von: XB (JIRA)
> Gesendet: 12.10.2012 00:15
> An: [email protected]
> Betreff: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" 
> fields are re-encoded
>
>
>     [ 
> https://issues.apache.org/jira/browse/THRIFT-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474556#comment-13474556
>  ]
>
> XB commented on THRIFT-1727:
> ----------------------------
>
> The problem lies within the confusion between thrift "binary" fields and 
> thrift "string" fields. Typically, "string" fields (sequences of characters) 
> are reduced to "binary" fields (sequence of bytes). However, in the current 
> Thrift Ruby implementation, the converse is done: sequences of bytes are 
> reduced to sequences of characters (represented by the Ruby String type). The 
> problem is that the higher layers do the reduction from sequences of bytes to 
> sequences of characters, but the lower layers do not recognize sequences of 
> bytes at all anymore, and just know sequences of characters. Then, these 
> sequences of characters are in turn converted into sequences of bytes (for 
> the wire-format).
>
> This patch solves the problem:
>
> {noformat}
> diff --git a/lib/rb/ext/binary_protocol_accelerated.c 
> b/lib/rb/ext/binary_protocol_accelerated.c
> index a8ebe7f..b75acf5 100644
> --- a/lib/rb/ext/binary_protocol_accelerated.c
> +++ b/lib/rb/ext/binary_protocol_accelerated.c
> @@ -81,7 +81,7 @@ static void write_string_direct(VALUE trans, VALUE str) {
>    if (TYPE(str) != T_STRING) {
>      rb_raise(rb_eStandardError, "Value should be a string");
>    }
> -  str = convert_to_utf8_byte_buffer(str);
> +  str = convert_to_byte_buffer(str);
>    write_i32_direct(trans, RSTRING_LEN(str));
>    rb_funcall(trans, write_method_id, 1, str);
>  }
> diff --git a/lib/rb/ext/bytes.c b/lib/rb/ext/bytes.c
> index 8a6fac4..d767cf4 100644
> --- a/lib/rb/ext/bytes.c
> +++ b/lib/rb/ext/bytes.c
> @@ -27,8 +27,8 @@ VALUE force_binary_encoding(VALUE buffer) {
>    return rb_funcall(thrift_bytes_module, force_binary_encoding_id, 1, 
> buffer);
>  }
>
> -VALUE convert_to_utf8_byte_buffer(VALUE string) {
> -  return rb_funcall(thrift_bytes_module, convert_to_utf8_byte_buffer_id, 1, 
> string);
> +VALUE convert_to_byte_buffer(VALUE string) {
> +  return rb_funcall(thrift_bytes_module, convert_to_byte_buffer_id, 1, 
> string);
>  }
>
>  VALUE convert_to_string(VALUE utf8_buffer) {
> diff --git a/lib/rb/ext/bytes.h b/lib/rb/ext/bytes.h
> index 7108d83..a74b2c4 100644
> --- a/lib/rb/ext/bytes.h
> +++ b/lib/rb/ext/bytes.h
> @@ -27,5 +27,5 @@
>   */
>
>  VALUE force_binary_encoding(VALUE buffer);
> -VALUE convert_to_utf8_byte_buffer(VALUE string);
> +VALUE convert_to_byte_buffer(VALUE string);
>  VALUE convert_to_string(VALUE utf8_buffer);
> diff --git a/lib/rb/ext/compact_protocol.c b/lib/rb/ext/compact_protocol.c
> index 0c05481..513c379 100644
> --- a/lib/rb/ext/compact_protocol.c
> +++ b/lib/rb/ext/compact_protocol.c
> @@ -306,7 +306,7 @@ VALUE rb_thrift_compact_proto_write_double(VALUE self, 
> VALUE dub) {
>
>  VALUE rb_thrift_compact_proto_write_string(VALUE self, VALUE str) {
>    VALUE transport = GET_TRANSPORT(self);
> -  str = convert_to_utf8_byte_buffer(str);
> +  str = convert_to_byte_buffer(str);
>    write_varint32(transport, RSTRING_LEN(str));
>    WRITE(transport, RSTRING_PTR(str), RSTRING_LEN(str));
>    return Qnil;
> diff --git a/lib/rb/ext/constants.h b/lib/rb/ext/constants.h
> index 3bfac88..499734c 100644
> --- a/lib/rb/ext/constants.h
> +++ b/lib/rb/ext/constants.h
> @@ -77,7 +77,7 @@ extern ID read_all_method_id;
>  extern ID read_into_buffer_method_id;
>  extern ID native_qmark_method_id;
>  extern ID force_binary_encoding_id;
> -extern ID convert_to_utf8_byte_buffer_id;
> +extern ID convert_to_byte_buffer_id;
>  extern ID convert_to_string_id;
>
>  extern ID fields_const_id;
> diff --git a/lib/rb/ext/thrift_native.c b/lib/rb/ext/thrift_native.c
> index f066d6c..292ef17 100644
> --- a/lib/rb/ext/thrift_native.c
> +++ b/lib/rb/ext/thrift_native.c
> @@ -93,7 +93,7 @@ ID read_all_method_id;
>  ID read_into_buffer_method_id;
>  ID native_qmark_method_id;
>  ID force_binary_encoding_id;
> -ID convert_to_utf8_byte_buffer_id;
> +ID convert_to_byte_buffer_id;
>  ID convert_to_string_id;
>
>  // constant ids
> @@ -180,7 +180,7 @@ void Init_thrift_native() {
>    read_into_buffer_method_id = rb_intern("read_into_buffer");
>    native_qmark_method_id = rb_intern("native?");
>    force_binary_encoding_id = rb_intern("force_binary_encoding");
> -  convert_to_utf8_byte_buffer_id = rb_intern("convert_to_utf8_byte_buffer");
> +  convert_to_byte_buffer_id = rb_intern("convert_to_byte_buffer");
>    convert_to_string_id = rb_intern("convert_to_string");
>
>    // constant ids
> diff --git a/lib/rb/lib/thrift/bytes.rb b/lib/rb/lib/thrift/bytes.rb
> index efd4f64..b10d966 100644
> --- a/lib/rb/lib/thrift/bytes.rb
> +++ b/lib/rb/lib/thrift/bytes.rb
> @@ -69,21 +69,27 @@ module Thrift
>          string.setbyte(index, byte)
>        end
>
> -      # Converts the given String to a UTF-8 byte buffer.
> +      # Converts the given String to a UTF-8 byte buffer if the string is not
> +      # already binary. If it is already binary, do nothing.
>        #
>        # string - The String to convert.
>        #
> -      # Returns a new String with BINARY encoding, containing the UTF-8
> -      # bytes of the original string.
> -      def self.convert_to_utf8_byte_buffer(string)
> -        if string.encoding != Encoding::UTF_8
> -          # transcode to UTF-8
> -          string = string.encode(Encoding::UTF_8)
> +      # Returns a String with BINARY encoding (either a new string containing
> +      # the UTF-8 bytes of the original string or the old string if it was
> +      # already binary)
> +      def self.convert_to_byte_buffer(string)
> +        if string.encoding != Encoding::BINARY
> +          if string.encoding != Encoding::UTF_8
> +            # transcode to UTF-8
> +            string = string.encode(Encoding::UTF_8)
> +          else
> +            # encoding is already UTF-8, but a duplicate is needed
> +            string = string.dup
> +          end
> +          string.force_encoding(Encoding::BINARY)
>          else
> -          # encoding is already UTF-8, but a duplicate is needed
> -          string = string.dup
> +          string
>          end
> -        string.force_encoding(Encoding::BINARY)
>        end
>
>        # Converts the given UTF-8 byte buffer into a String
> @@ -116,14 +122,14 @@ module Thrift
>          string[index] = byte
>        end
>
> -      def self.convert_to_utf8_byte_buffer(string)
> +      def self.convert_to_byte_buffer(string)
>          # This assumes $KCODE is 'UTF8'/'U', which would mean the String is 
> already a UTF-8 byte buffer
>          # TODO consider handling other $KCODE values and transcoding with 
> iconv
>          string
>        end
>
>        def self.convert_to_string(utf8_buffer)
> -        # See comment in 'convert_to_utf8_byte_buffer' for relevant 
> assumptions.
> +        # See comment in 'convert_to_byte_buffer' for relevant assumptions.
>          utf8_buffer
>        end
>      end
> diff --git a/lib/rb/lib/thrift/protocol/binary_protocol.rb 
> b/lib/rb/lib/thrift/protocol/binary_protocol.rb
> index 2528276..d227bdd 100644
> --- a/lib/rb/lib/thrift/protocol/binary_protocol.rb
> +++ b/lib/rb/lib/thrift/protocol/binary_protocol.rb
> @@ -107,7 +107,7 @@ module Thrift
>      end
>
>      def write_string(str)
> -      str = Bytes.convert_to_utf8_byte_buffer(str)
> +      str = Bytes.convert_to_byte_buffer(str)
>        write_i32(str.length)
>        trans.write(str)
>      end
> diff --git a/lib/rb/lib/thrift/protocol/compact_protocol.rb 
> b/lib/rb/lib/thrift/protocol/compact_protocol.rb
> index 758e1ae..c813561 100644
> --- a/lib/rb/lib/thrift/protocol/compact_protocol.rb
> +++ b/lib/rb/lib/thrift/protocol/compact_protocol.rb
> @@ -210,7 +210,7 @@ module Thrift
>      end
>
>      def write_string(str)
> -      str = Bytes.convert_to_utf8_byte_buffer(str)
> +      str = Bytes.convert_to_byte_buffer(str)
>        write_varint32(str.length)
>        @trans.write(str)
>      end
> diff --git a/lib/rb/spec/bytes_spec.rb b/lib/rb/spec/bytes_spec.rb
> index b82e304..863c1bb 100644
> --- a/lib/rb/spec/bytes_spec.rb
> +++ b/lib/rb/spec/bytes_spec.rb
> @@ -64,12 +64,12 @@ describe Thrift::Bytes do
>        end
>      end
>
> -    describe '.convert_to_utf8_byte_buffer' do
> +    describe '.convert_to_byte_buffer' do
>        it 'should convert UTF-8 String to byte buffer' do
>          e = "\u20AC".encode('UTF-8') # a string with euro sign character 
> U+20AC
>          e.length.should == 1
>
> -        a = Thrift::Bytes.convert_to_utf8_byte_buffer e
> +        a = Thrift::Bytes.convert_to_byte_buffer e
>          a.encoding.should == Encoding::BINARY
>          a.length.should == 3
>          a.unpack('C*').should == [0xE2, 0x82, 0xAC]
> @@ -81,7 +81,7 @@ describe Thrift::Bytes do
>          e.length.should == 1
>          e.unpack('C*').should == [0xA4] # euro sign is a different code 
> point in ISO-8859-15
>
> -        a = Thrift::Bytes.convert_to_utf8_byte_buffer e
> +        a = Thrift::Bytes.convert_to_byte_buffer e
>          a.encoding.should == Encoding::BINARY
>          a.length.should == 3
>          a.unpack('C*').should == [0xE2, 0x82, 0xAC]
> @@ -139,10 +139,10 @@ describe Thrift::Bytes do
>        end
>      end
>
> -    describe '.convert_to_utf8_byte_buffer' do
> +    describe '.convert_to_byte_buffer' do
>        it 'should be a no-op' do
>          e = 'STRING'
> -        a = Thrift::Bytes.convert_to_utf8_byte_buffer e
> +        a = Thrift::Bytes.convert_to_byte_buffer e
>          a.should == e
>          a.should be(e)
>        end
> {noformat}
>
> Basically, the method convert_to_utf8_byte_buffer() assumes that all Strings 
> it gets passed are just sequences of characters, which is wrong. Strings 
> which are sequences of bytes should have encoding Encoding::BINARY. If this 
> is the case, then the method convert_to_utf8_byte_buffer() should not perform 
> any conversion, because the string is already in the desired format. The 
> patch achieves that. However, the naming of the method is also not correct 
> anymore then, thus, it is renamed to make it clear that the result of this 
> method is not always a UTF-8 encoded string. (But it also does not need to be 
> one).
>
>> Ruby-1.9: data loss: "binary" fields are re-encoded
>> ---------------------------------------------------
>>
>>                 Key: THRIFT-1727
>>                 URL: https://issues.apache.org/jira/browse/THRIFT-1727
>>             Project: Thrift
>>          Issue Type: Bug
>>          Components: Ruby - Library
>>    Affects Versions: 0.9
>>         Environment: JRuby 1.6.8 using "--1.9" command line parameter.
>>            Reporter: XB
>>
>> When setting a binary field of a Thrift object with some binary data (e.g. a 
>> string whose encoding is "ASCII-8BIT") and then serializing this object, the 
>> binary data is re-encoded. That is, it is encoded as if it were not a 
>> sequence of bytes but a sequence of characters, encoded using the ISO-8859-1 
>> encoding. This assumed ISO-8859-1 sequence of characters is then converted 
>> into UTF-8 (by BinaryProtocol or CompactProtocol). This basically means that 
>> all bytes whose values are between 0x80 (inclusive) and 0x100 (exclusive) 
>> are converted into multi-byte sequences. This leads to data corruption.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira

AW: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields are re-encoded

Reply via email to