Oh, I thought you were talking about it in reference to the Ruby lib only. On Oct 18, 2012, at 1:59 AM, Jens Geyer <[email protected]> wrote:
> It's not to sat that binary is not supported at all. It's about the > inconsistencies. > > ________________________________ > Von: Jens Geyer > Gesendet: 18.10.2012 08:59 > An: [email protected] > Betreff: AW: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" > fields are re-encoded > > Hi Nathan, > > I referred to the binary data type. > > This one is in a semi-defined state, and IIRC there exist slightly different > ways how binary fields are handled throughout the various languages. A ticket > with some discussion already exists, can't remember the number right now. > > The web site says ( or did say in the past) something to the effect that that > binary is intended to become a fully supported real data type some day in the > future. > > Greetings, > Jens > ________________________________ > Von: Nathan Beyer > Gesendet: 18.10.2012 03:13 > An: [email protected] > Betreff: Re: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" > fields are re-encoded > > Jens, Can you elaborate on the clean up that you'd like to see? I > missed your previous comments, but would like to read them if you can > put them up again or post some links. > > On Fri, Oct 12, 2012 at 2:03 AM, Jens Geyer <[email protected]> wrote: >> Again I would propose to clean that up in general. Same confusion happens at >> other places too. >> >> $0,02 >> Jens >> ________________________________ >> Von: XB (JIRA) >> Gesendet: 12.10.2012 00:15 >> An: [email protected] >> Betreff: [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" >> fields are re-encoded >> >> >> [ >> https://issues.apache.org/jira/browse/THRIFT-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474556#comment-13474556 >> ] >> >> XB commented on THRIFT-1727: >> ---------------------------- >> >> The problem lies within the confusion between thrift "binary" fields and >> thrift "string" fields. Typically, "string" fields (sequences of characters) >> are reduced to "binary" fields (sequence of bytes). However, in the current >> Thrift Ruby implementation, the converse is done: sequences of bytes are >> reduced to sequences of characters (represented by the Ruby String type). >> The problem is that the higher layers do the reduction from sequences of >> bytes to sequences of characters, but the lower layers do not recognize >> sequences of bytes at all anymore, and just know sequences of characters. >> Then, these sequences of characters are in turn converted into sequences of >> bytes (for the wire-format). >> >> This patch solves the problem: >> >> {noformat} >> diff --git a/lib/rb/ext/binary_protocol_accelerated.c >> b/lib/rb/ext/binary_protocol_accelerated.c >> index a8ebe7f..b75acf5 100644 >> --- a/lib/rb/ext/binary_protocol_accelerated.c >> +++ b/lib/rb/ext/binary_protocol_accelerated.c >> @@ -81,7 +81,7 @@ static void write_string_direct(VALUE trans, VALUE str) { >> if (TYPE(str) != T_STRING) { >> rb_raise(rb_eStandardError, "Value should be a string"); >> } >> - str = convert_to_utf8_byte_buffer(str); >> + str = convert_to_byte_buffer(str); >> write_i32_direct(trans, RSTRING_LEN(str)); >> rb_funcall(trans, write_method_id, 1, str); >> } >> diff --git a/lib/rb/ext/bytes.c b/lib/rb/ext/bytes.c >> index 8a6fac4..d767cf4 100644 >> --- a/lib/rb/ext/bytes.c >> +++ b/lib/rb/ext/bytes.c >> @@ -27,8 +27,8 @@ VALUE force_binary_encoding(VALUE buffer) { >> return rb_funcall(thrift_bytes_module, force_binary_encoding_id, 1, >> buffer); >> } >> >> -VALUE convert_to_utf8_byte_buffer(VALUE string) { >> - return rb_funcall(thrift_bytes_module, convert_to_utf8_byte_buffer_id, 1, >> string); >> +VALUE convert_to_byte_buffer(VALUE string) { >> + return rb_funcall(thrift_bytes_module, convert_to_byte_buffer_id, 1, >> string); >> } >> >> VALUE convert_to_string(VALUE utf8_buffer) { >> diff --git a/lib/rb/ext/bytes.h b/lib/rb/ext/bytes.h >> index 7108d83..a74b2c4 100644 >> --- a/lib/rb/ext/bytes.h >> +++ b/lib/rb/ext/bytes.h >> @@ -27,5 +27,5 @@ >> */ >> >> VALUE force_binary_encoding(VALUE buffer); >> -VALUE convert_to_utf8_byte_buffer(VALUE string); >> +VALUE convert_to_byte_buffer(VALUE string); >> VALUE convert_to_string(VALUE utf8_buffer); >> diff --git a/lib/rb/ext/compact_protocol.c b/lib/rb/ext/compact_protocol.c >> index 0c05481..513c379 100644 >> --- a/lib/rb/ext/compact_protocol.c >> +++ b/lib/rb/ext/compact_protocol.c >> @@ -306,7 +306,7 @@ VALUE rb_thrift_compact_proto_write_double(VALUE self, >> VALUE dub) { >> >> VALUE rb_thrift_compact_proto_write_string(VALUE self, VALUE str) { >> VALUE transport = GET_TRANSPORT(self); >> - str = convert_to_utf8_byte_buffer(str); >> + str = convert_to_byte_buffer(str); >> write_varint32(transport, RSTRING_LEN(str)); >> WRITE(transport, RSTRING_PTR(str), RSTRING_LEN(str)); >> return Qnil; >> diff --git a/lib/rb/ext/constants.h b/lib/rb/ext/constants.h >> index 3bfac88..499734c 100644 >> --- a/lib/rb/ext/constants.h >> +++ b/lib/rb/ext/constants.h >> @@ -77,7 +77,7 @@ extern ID read_all_method_id; >> extern ID read_into_buffer_method_id; >> extern ID native_qmark_method_id; >> extern ID force_binary_encoding_id; >> -extern ID convert_to_utf8_byte_buffer_id; >> +extern ID convert_to_byte_buffer_id; >> extern ID convert_to_string_id; >> >> extern ID fields_const_id; >> diff --git a/lib/rb/ext/thrift_native.c b/lib/rb/ext/thrift_native.c >> index f066d6c..292ef17 100644 >> --- a/lib/rb/ext/thrift_native.c >> +++ b/lib/rb/ext/thrift_native.c >> @@ -93,7 +93,7 @@ ID read_all_method_id; >> ID read_into_buffer_method_id; >> ID native_qmark_method_id; >> ID force_binary_encoding_id; >> -ID convert_to_utf8_byte_buffer_id; >> +ID convert_to_byte_buffer_id; >> ID convert_to_string_id; >> >> // constant ids >> @@ -180,7 +180,7 @@ void Init_thrift_native() { >> read_into_buffer_method_id = rb_intern("read_into_buffer"); >> native_qmark_method_id = rb_intern("native?"); >> force_binary_encoding_id = rb_intern("force_binary_encoding"); >> - convert_to_utf8_byte_buffer_id = rb_intern("convert_to_utf8_byte_buffer"); >> + convert_to_byte_buffer_id = rb_intern("convert_to_byte_buffer"); >> convert_to_string_id = rb_intern("convert_to_string"); >> >> // constant ids >> diff --git a/lib/rb/lib/thrift/bytes.rb b/lib/rb/lib/thrift/bytes.rb >> index efd4f64..b10d966 100644 >> --- a/lib/rb/lib/thrift/bytes.rb >> +++ b/lib/rb/lib/thrift/bytes.rb >> @@ -69,21 +69,27 @@ module Thrift >> string.setbyte(index, byte) >> end >> >> - # Converts the given String to a UTF-8 byte buffer. >> + # Converts the given String to a UTF-8 byte buffer if the string is >> not >> + # already binary. If it is already binary, do nothing. >> # >> # string - The String to convert. >> # >> - # Returns a new String with BINARY encoding, containing the UTF-8 >> - # bytes of the original string. >> - def self.convert_to_utf8_byte_buffer(string) >> - if string.encoding != Encoding::UTF_8 >> - # transcode to UTF-8 >> - string = string.encode(Encoding::UTF_8) >> + # Returns a String with BINARY encoding (either a new string >> containing >> + # the UTF-8 bytes of the original string or the old string if it was >> + # already binary) >> + def self.convert_to_byte_buffer(string) >> + if string.encoding != Encoding::BINARY >> + if string.encoding != Encoding::UTF_8 >> + # transcode to UTF-8 >> + string = string.encode(Encoding::UTF_8) >> + else >> + # encoding is already UTF-8, but a duplicate is needed >> + string = string.dup >> + end >> + string.force_encoding(Encoding::BINARY) >> else >> - # encoding is already UTF-8, but a duplicate is needed >> - string = string.dup >> + string >> end >> - string.force_encoding(Encoding::BINARY) >> end >> >> # Converts the given UTF-8 byte buffer into a String >> @@ -116,14 +122,14 @@ module Thrift >> string[index] = byte >> end >> >> - def self.convert_to_utf8_byte_buffer(string) >> + def self.convert_to_byte_buffer(string) >> # This assumes $KCODE is 'UTF8'/'U', which would mean the String is >> already a UTF-8 byte buffer >> # TODO consider handling other $KCODE values and transcoding with >> iconv >> string >> end >> >> def self.convert_to_string(utf8_buffer) >> - # See comment in 'convert_to_utf8_byte_buffer' for relevant >> assumptions. >> + # See comment in 'convert_to_byte_buffer' for relevant assumptions. >> utf8_buffer >> end >> end >> diff --git a/lib/rb/lib/thrift/protocol/binary_protocol.rb >> b/lib/rb/lib/thrift/protocol/binary_protocol.rb >> index 2528276..d227bdd 100644 >> --- a/lib/rb/lib/thrift/protocol/binary_protocol.rb >> +++ b/lib/rb/lib/thrift/protocol/binary_protocol.rb >> @@ -107,7 +107,7 @@ module Thrift >> end >> >> def write_string(str) >> - str = Bytes.convert_to_utf8_byte_buffer(str) >> + str = Bytes.convert_to_byte_buffer(str) >> write_i32(str.length) >> trans.write(str) >> end >> diff --git a/lib/rb/lib/thrift/protocol/compact_protocol.rb >> b/lib/rb/lib/thrift/protocol/compact_protocol.rb >> index 758e1ae..c813561 100644 >> --- a/lib/rb/lib/thrift/protocol/compact_protocol.rb >> +++ b/lib/rb/lib/thrift/protocol/compact_protocol.rb >> @@ -210,7 +210,7 @@ module Thrift >> end >> >> def write_string(str) >> - str = Bytes.convert_to_utf8_byte_buffer(str) >> + str = Bytes.convert_to_byte_buffer(str) >> write_varint32(str.length) >> @trans.write(str) >> end >> diff --git a/lib/rb/spec/bytes_spec.rb b/lib/rb/spec/bytes_spec.rb >> index b82e304..863c1bb 100644 >> --- a/lib/rb/spec/bytes_spec.rb >> +++ b/lib/rb/spec/bytes_spec.rb >> @@ -64,12 +64,12 @@ describe Thrift::Bytes do >> end >> end >> >> - describe '.convert_to_utf8_byte_buffer' do >> + describe '.convert_to_byte_buffer' do >> it 'should convert UTF-8 String to byte buffer' do >> e = "\u20AC".encode('UTF-8') # a string with euro sign character >> U+20AC >> e.length.should == 1 >> >> - a = Thrift::Bytes.convert_to_utf8_byte_buffer e >> + a = Thrift::Bytes.convert_to_byte_buffer e >> a.encoding.should == Encoding::BINARY >> a.length.should == 3 >> a.unpack('C*').should == [0xE2, 0x82, 0xAC] >> @@ -81,7 +81,7 @@ describe Thrift::Bytes do >> e.length.should == 1 >> e.unpack('C*').should == [0xA4] # euro sign is a different code >> point in ISO-8859-15 >> >> - a = Thrift::Bytes.convert_to_utf8_byte_buffer e >> + a = Thrift::Bytes.convert_to_byte_buffer e >> a.encoding.should == Encoding::BINARY >> a.length.should == 3 >> a.unpack('C*').should == [0xE2, 0x82, 0xAC] >> @@ -139,10 +139,10 @@ describe Thrift::Bytes do >> end >> end >> >> - describe '.convert_to_utf8_byte_buffer' do >> + describe '.convert_to_byte_buffer' do >> it 'should be a no-op' do >> e = 'STRING' >> - a = Thrift::Bytes.convert_to_utf8_byte_buffer e >> + a = Thrift::Bytes.convert_to_byte_buffer e >> a.should == e >> a.should be(e) >> end >> {noformat} >> >> Basically, the method convert_to_utf8_byte_buffer() assumes that all Strings >> it gets passed are just sequences of characters, which is wrong. Strings >> which are sequences of bytes should have encoding Encoding::BINARY. If this >> is the case, then the method convert_to_utf8_byte_buffer() should not >> perform any conversion, because the string is already in the desired format. >> The patch achieves that. However, the naming of the method is also not >> correct anymore then, thus, it is renamed to make it clear that the result >> of this method is not always a UTF-8 encoded string. (But it also does not >> need to be one). >> >>> Ruby-1.9: data loss: "binary" fields are re-encoded >>> --------------------------------------------------- >>> >>> Key: THRIFT-1727 >>> URL: https://issues.apache.org/jira/browse/THRIFT-1727 >>> Project: Thrift >>> Issue Type: Bug >>> Components: Ruby - Library >>> Affects Versions: 0.9 >>> Environment: JRuby 1.6.8 using "--1.9" command line parameter. >>> Reporter: XB >>> >>> When setting a binary field of a Thrift object with some binary data (e.g. >>> a string whose encoding is "ASCII-8BIT") and then serializing this object, >>> the binary data is re-encoded. That is, it is encoded as if it were not a >>> sequence of bytes but a sequence of characters, encoded using the >>> ISO-8859-1 encoding. This assumed ISO-8859-1 sequence of characters is then >>> converted into UTF-8 (by BinaryProtocol or CompactProtocol). This basically >>> means that all bytes whose values are between 0x80 (inclusive) and 0x100 >>> (exclusive) are converted into multi-byte sequences. This leads to data >>> corruption. >> >> -- >> This message is automatically generated by JIRA. >> If you think it was sent incorrectly, please contact your JIRA administrators >> For more information on JIRA, see: http://www.atlassian.com/software/jira
