Jackie Murphy created AVRO-1637:
-----------------------------------
Summary: Handling multibyte UTF-8 characters in Ruby
Key: AVRO-1637
URL: https://issues.apache.org/jira/browse/AVRO-1637
Project: Avro
Issue Type: Bug
Reporter: Jackie Murphy
Priority: Minor
It looks like the Ruby implementation of Avro doesn't successfully round-trip
UTF-8 encoded strings containing multibyte characters.
Example:
{code}
require 'avro'
def serialize(obj, schema)
buffer = StringIO.new
encoder = Avro::IO::BinaryEncoder.new(buffer)
datum_writer = Avro::IO::DatumWriter.new(schema)
datum_writer.write(obj, encoder)
buffer.seek(0)
buffer.read
end
def deserialize(avro_obj, schema)
reader = StringIO.new(avro_obj)
decoder = Avro::IO::BinaryDecoder.new(reader)
datum_reader = Avro::IO::DatumReader.new(schema)
datum_reader.read(decoder)
end
{code}
{code}
> schema =
> Avro::Schema.parse("{\"type\":\"record\",\"name\":\"Example\",\"fields\":[{\"name\":\"example_field\",\"type\":\"string\"},
> {\"name\":\"other_field\",\"type\":\"string\"}]}")
> deserialize(serialize({'example_field'=> 'héllö world',
> 'other_field'=>'goodbye world'}, schema), schema)
{"example_field"=>"h\xC3\xA9ll\xC3\xB6 wor", "other_field"=>"d\x1Agoodbye
world"}
{code}
Note that it looks like it's computing the length of the first field
incorrectly (length of string in characters rather than in bytes?), and end of
the first field spills into the second field.
Also, if the bytes happen to be especially unlucky in how they line up, we can
get an {{ArgumentError}}
{code}
> deserialize(serialize({'example_field'=> '‘hello’ world',
> 'other_field'=>'goodbye world'}, schema), schema)
ArgumentError: negative length -56 given
{code}
This looks similar to a previous issue with the Perl implementation in AVRO-1517
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)