AW: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields are re-encoded

Jens Geyer Fri, 12 Oct 2012 00:05:28 -0700

Again I would propose to clean that up in general. Same confusion happens at 
other places too.


$0,02
Jens
________________________________
Von: XB (JIRA)
Gesendet: 12.10.2012 00:15
An: [email protected]
Betreff: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields 
are re-encoded


    [ 
https://issues.apache.org/jira/browse/THRIFT-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474556#comment-13474556
 ]

XB commented on THRIFT-1727:
----------------------------

The problem lies within the confusion between thrift "binary" fields and thrift 
"string" fields. Typically, "string" fields (sequences of characters) are 
reduced to "binary" fields (sequence of bytes). However, in the current Thrift 
Ruby implementation, the converse is done: sequences of bytes are reduced to 
sequences of characters (represented by the Ruby String type). The problem is 
that the higher layers do the reduction from sequences of bytes to sequences of 
characters, but the lower layers do not recognize sequences of bytes at all 
anymore, and just know sequences of characters. Then, these sequences of 
characters are in turn converted into sequences of bytes (for the wire-format).

This patch solves the problem:

{noformat}
diff --git a/lib/rb/ext/binary_protocol_accelerated.c 
b/lib/rb/ext/binary_protocol_accelerated.c
index a8ebe7f..b75acf5 100644
--- a/lib/rb/ext/binary_protocol_accelerated.c
+++ b/lib/rb/ext/binary_protocol_accelerated.c
@@ -81,7 +81,7 @@ static void write_string_direct(VALUE trans, VALUE str) {
   if (TYPE(str) != T_STRING) {
     rb_raise(rb_eStandardError, "Value should be a string");
   }
-  str = convert_to_utf8_byte_buffer(str);
+  str = convert_to_byte_buffer(str);
   write_i32_direct(trans, RSTRING_LEN(str));
   rb_funcall(trans, write_method_id, 1, str);
 }
diff --git a/lib/rb/ext/bytes.c b/lib/rb/ext/bytes.c
index 8a6fac4..d767cf4 100644
--- a/lib/rb/ext/bytes.c
+++ b/lib/rb/ext/bytes.c
@@ -27,8 +27,8 @@ VALUE force_binary_encoding(VALUE buffer) {
   return rb_funcall(thrift_bytes_module, force_binary_encoding_id, 1, buffer);
 }

-VALUE convert_to_utf8_byte_buffer(VALUE string) {
-  return rb_funcall(thrift_bytes_module, convert_to_utf8_byte_buffer_id, 1, 
string);
+VALUE convert_to_byte_buffer(VALUE string) {
+  return rb_funcall(thrift_bytes_module, convert_to_byte_buffer_id, 1, string);
 }

 VALUE convert_to_string(VALUE utf8_buffer) {
diff --git a/lib/rb/ext/bytes.h b/lib/rb/ext/bytes.h
index 7108d83..a74b2c4 100644
--- a/lib/rb/ext/bytes.h
+++ b/lib/rb/ext/bytes.h
@@ -27,5 +27,5 @@
  */

 VALUE force_binary_encoding(VALUE buffer);
-VALUE convert_to_utf8_byte_buffer(VALUE string);
+VALUE convert_to_byte_buffer(VALUE string);
 VALUE convert_to_string(VALUE utf8_buffer);
diff --git a/lib/rb/ext/compact_protocol.c b/lib/rb/ext/compact_protocol.c
index 0c05481..513c379 100644
--- a/lib/rb/ext/compact_protocol.c
+++ b/lib/rb/ext/compact_protocol.c
@@ -306,7 +306,7 @@ VALUE rb_thrift_compact_proto_write_double(VALUE self, 
VALUE dub) {

 VALUE rb_thrift_compact_proto_write_string(VALUE self, VALUE str) {
   VALUE transport = GET_TRANSPORT(self);
-  str = convert_to_utf8_byte_buffer(str);
+  str = convert_to_byte_buffer(str);
   write_varint32(transport, RSTRING_LEN(str));
   WRITE(transport, RSTRING_PTR(str), RSTRING_LEN(str));
   return Qnil;
diff --git a/lib/rb/ext/constants.h b/lib/rb/ext/constants.h
index 3bfac88..499734c 100644
--- a/lib/rb/ext/constants.h
+++ b/lib/rb/ext/constants.h
@@ -77,7 +77,7 @@ extern ID read_all_method_id;
 extern ID read_into_buffer_method_id;
 extern ID native_qmark_method_id;
 extern ID force_binary_encoding_id;
-extern ID convert_to_utf8_byte_buffer_id;
+extern ID convert_to_byte_buffer_id;
 extern ID convert_to_string_id;

 extern ID fields_const_id;
diff --git a/lib/rb/ext/thrift_native.c b/lib/rb/ext/thrift_native.c
index f066d6c..292ef17 100644
--- a/lib/rb/ext/thrift_native.c
+++ b/lib/rb/ext/thrift_native.c
@@ -93,7 +93,7 @@ ID read_all_method_id;
 ID read_into_buffer_method_id;
 ID native_qmark_method_id;
 ID force_binary_encoding_id;
-ID convert_to_utf8_byte_buffer_id;
+ID convert_to_byte_buffer_id;
 ID convert_to_string_id;

 // constant ids
@@ -180,7 +180,7 @@ void Init_thrift_native() {
   read_into_buffer_method_id = rb_intern("read_into_buffer");
   native_qmark_method_id = rb_intern("native?");
   force_binary_encoding_id = rb_intern("force_binary_encoding");
-  convert_to_utf8_byte_buffer_id = rb_intern("convert_to_utf8_byte_buffer");
+  convert_to_byte_buffer_id = rb_intern("convert_to_byte_buffer");
   convert_to_string_id = rb_intern("convert_to_string");

   // constant ids
diff --git a/lib/rb/lib/thrift/bytes.rb b/lib/rb/lib/thrift/bytes.rb
index efd4f64..b10d966 100644
--- a/lib/rb/lib/thrift/bytes.rb
+++ b/lib/rb/lib/thrift/bytes.rb
@@ -69,21 +69,27 @@ module Thrift
         string.setbyte(index, byte)
       end

-      # Converts the given String to a UTF-8 byte buffer.
+      # Converts the given String to a UTF-8 byte buffer if the string is not
+      # already binary. If it is already binary, do nothing.
       #
       # string - The String to convert.
       #
-      # Returns a new String with BINARY encoding, containing the UTF-8
-      # bytes of the original string.
-      def self.convert_to_utf8_byte_buffer(string)
-        if string.encoding != Encoding::UTF_8
-          # transcode to UTF-8
-          string = string.encode(Encoding::UTF_8)
+      # Returns a String with BINARY encoding (either a new string containing
+      # the UTF-8 bytes of the original string or the old string if it was
+      # already binary)
+      def self.convert_to_byte_buffer(string)
+        if string.encoding != Encoding::BINARY
+          if string.encoding != Encoding::UTF_8
+            # transcode to UTF-8
+            string = string.encode(Encoding::UTF_8)
+          else
+            # encoding is already UTF-8, but a duplicate is needed
+            string = string.dup
+          end
+          string.force_encoding(Encoding::BINARY)
         else
-          # encoding is already UTF-8, but a duplicate is needed
-          string = string.dup
+          string
         end
-        string.force_encoding(Encoding::BINARY)
       end

       # Converts the given UTF-8 byte buffer into a String
@@ -116,14 +122,14 @@ module Thrift
         string[index] = byte
       end

-      def self.convert_to_utf8_byte_buffer(string)
+      def self.convert_to_byte_buffer(string)
         # This assumes $KCODE is 'UTF8'/'U', which would mean the String is 
already a UTF-8 byte buffer
         # TODO consider handling other $KCODE values and transcoding with iconv
         string
       end

       def self.convert_to_string(utf8_buffer)
-        # See comment in 'convert_to_utf8_byte_buffer' for relevant 
assumptions.
+        # See comment in 'convert_to_byte_buffer' for relevant assumptions.
         utf8_buffer
       end
     end
diff --git a/lib/rb/lib/thrift/protocol/binary_protocol.rb 
b/lib/rb/lib/thrift/protocol/binary_protocol.rb
index 2528276..d227bdd 100644
--- a/lib/rb/lib/thrift/protocol/binary_protocol.rb
+++ b/lib/rb/lib/thrift/protocol/binary_protocol.rb
@@ -107,7 +107,7 @@ module Thrift
     end

     def write_string(str)
-      str = Bytes.convert_to_utf8_byte_buffer(str)
+      str = Bytes.convert_to_byte_buffer(str)
       write_i32(str.length)
       trans.write(str)
     end
diff --git a/lib/rb/lib/thrift/protocol/compact_protocol.rb 
b/lib/rb/lib/thrift/protocol/compact_protocol.rb
index 758e1ae..c813561 100644
--- a/lib/rb/lib/thrift/protocol/compact_protocol.rb
+++ b/lib/rb/lib/thrift/protocol/compact_protocol.rb
@@ -210,7 +210,7 @@ module Thrift
     end

     def write_string(str)
-      str = Bytes.convert_to_utf8_byte_buffer(str)
+      str = Bytes.convert_to_byte_buffer(str)
       write_varint32(str.length)
       @trans.write(str)
     end
diff --git a/lib/rb/spec/bytes_spec.rb b/lib/rb/spec/bytes_spec.rb
index b82e304..863c1bb 100644
--- a/lib/rb/spec/bytes_spec.rb
+++ b/lib/rb/spec/bytes_spec.rb
@@ -64,12 +64,12 @@ describe Thrift::Bytes do
       end
     end

-    describe '.convert_to_utf8_byte_buffer' do
+    describe '.convert_to_byte_buffer' do
       it 'should convert UTF-8 String to byte buffer' do
         e = "\u20AC".encode('UTF-8') # a string with euro sign character U+20AC
         e.length.should == 1

-        a = Thrift::Bytes.convert_to_utf8_byte_buffer e
+        a = Thrift::Bytes.convert_to_byte_buffer e
         a.encoding.should == Encoding::BINARY
         a.length.should == 3
         a.unpack('C*').should == [0xE2, 0x82, 0xAC]
@@ -81,7 +81,7 @@ describe Thrift::Bytes do
         e.length.should == 1
         e.unpack('C*').should == [0xA4] # euro sign is a different code point 
in ISO-8859-15

-        a = Thrift::Bytes.convert_to_utf8_byte_buffer e
+        a = Thrift::Bytes.convert_to_byte_buffer e
         a.encoding.should == Encoding::BINARY
         a.length.should == 3
         a.unpack('C*').should == [0xE2, 0x82, 0xAC]
@@ -139,10 +139,10 @@ describe Thrift::Bytes do
       end
     end

-    describe '.convert_to_utf8_byte_buffer' do
+    describe '.convert_to_byte_buffer' do
       it 'should be a no-op' do
         e = 'STRING'
-        a = Thrift::Bytes.convert_to_utf8_byte_buffer e
+        a = Thrift::Bytes.convert_to_byte_buffer e
         a.should == e
         a.should be(e)
       end
{noformat}

Basically, the method convert_to_utf8_byte_buffer() assumes that all Strings it 
gets passed are just sequences of characters, which is wrong. Strings which are 
sequences of bytes should have encoding Encoding::BINARY. If this is the case, 
then the method convert_to_utf8_byte_buffer() should not perform any 
conversion, because the string is already in the desired format. The patch 
achieves that. However, the naming of the method is also not correct anymore 
then, thus, it is renamed to make it clear that the result of this method is 
not always a UTF-8 encoded string. (But it also does not need to be one).

> Ruby-1.9: data loss: "binary" fields are re-encoded
> ---------------------------------------------------
>
>                 Key: THRIFT-1727
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1727
>             Project: Thrift
>          Issue Type: Bug
>          Components: Ruby - Library
>    Affects Versions: 0.9
>         Environment: JRuby 1.6.8 using "--1.9" command line parameter.
>            Reporter: XB
>
> When setting a binary field of a Thrift object with some binary data (e.g. a 
> string whose encoding is "ASCII-8BIT") and then serializing this object, the 
> binary data is re-encoded. That is, it is encoded as if it were not a 
> sequence of bytes but a sequence of characters, encoded using the ISO-8859-1 
> encoding. This assumed ISO-8859-1 sequence of characters is then converted 
> into UTF-8 (by BinaryProtocol or CompactProtocol). This basically means that 
> all bytes whose values are between 0x80 (inclusive) and 0x100 (exclusive) are 
> converted into multi-byte sequences. This leads to data corruption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

AW: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields are re-encoded

Reply via email to