[
https://issues.apache.org/jira/browse/AVRO-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lucas Heimberg updated AVRO-3005:
---------------------------------
Description:
Avro.IO.BinaryDecoder.ReadString() fails for strings with length > 256, i.e.
when the StackallocThreshold is exceeded.
This can be seen when serializing and subsequently deserializing a
GenericRecord of schema
{code:java}
{
"type": "record",
"name": "Foo",
"fields": [
{ "name": "x", "type": "string" }
]
}{code}
with a field x containing a string of length > 256, as done in the test case:
{code:java}
public void Test()
{
var schema = (RecordSchema) Schema.Parse("{ \"type\":\"record\",
\"name\":\"Foo\",\"fields\":[{\"name\":\"x\",\"type\":\"string\"}]}");
var datum = new GenericRecord(schema);
datum.Add("x", new String('x', 257));
byte[] serialized;
using (var ms = new MemoryStream())
{
var enc = new BinaryEncoder(ms);
var writer = new GenericDatumWriter<GenericRecord>(schema);
writer.Write(datum, enc);
serialized = ms.ToArray();
}
using (var ms = new MemoryStream(serialized))
{
var dec = new BinaryDecoder(ms);
var deserialized = new GenericRecord(schema);
var reader = new GenericDatumReader<GenericRecord>(schema, schema);
reader.Read(deserialized, dec);
Assert.Equal(datum, deserialized);
}
}{code}
which yields the following exception
{code:java}
Avro.AvroException
End of stream reached
at Avro.IO.BinaryDecoder.Read(Span`1 buffer)
at Avro.IO.BinaryDecoder.ReadString()
at Avro.Generic.PreresolvingDatumReader`1.<>c.<ResolveReader>b__21_1(Decoder
d)
at
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass37_0.<Read>b__0(Object
r, Decoder d)
at
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_1.<ResolveRecord>b__2(Object
rec, Decoder d)
at Avro.Generic.PreresolvingDatumReader`1.ReadRecord(Object reuse, Decoder
decoder, RecordAccess recordAccess, IEnumerable`1 readSteps)
at
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_0.<ResolveRecord>b__0(Object
r, Decoder d)
at Avro.Generic.PreresolvingDatumReader`1.Read(T reuse, Decoder decoder)
at AvroTests.AvroTests.Test(Int32 n) in
C:\Users\l.heimberg\Source\Repos\AvroTests\AvroTests\AvroTests.cs:line 41
{code}
The reason seems to be the following: when a string of length <=
StackallocThreshold (=256) is read, a buffer to read the content of the string
from the stream is allocated on the stack with the exact length of the string.
If the length is > StackallocThreshold, the buffer is obtained from
ArrayPool<byte>.Shared.Rent(length), which returns a buffer of *minimum* length
'length', but possibly also a larger buffer.
The Read(Span<byte>) method always tries to read as much bytes from the input
stream as this buffer has length, and in particular will fail with the
exception shown above when the stream does not have enough data anymore. Thus,
if the string has expected length > StackallocThreshold, the Read method will
either throw the above AvroException (when the string is the last element in
the stream) or will already consume parts of following data items in the
stream, in any case causing corruption.
The provided patch turns the byte array returned by the ArrayPool into a Span
with the correct length using the Splice method, instead of casting it
implicitly to Span<byte>.
Possiby related:
[https://github.com/confluentinc/confluent-kafka-dotnet/issues/1398#issuecomment-748171083]
was:
Avro.IO.BinaryDecoder.ReadString() fails for strings with length > 256, i.e.
when the StackallocThreshold is exceeded.
This can be seen when serializing and subsequently deserializing a
GenericRecord of schema
{code:java}
{
"type": "record",
"name": "Foo",
"fields": [
{ "name": "x", "type": "string" }
]
}{code}
with a field x containing a string of length > 256, as done in the test case:
{code:java}
public void Test()
{
var schema = (RecordSchema) Schema.Parse("{ \"type\":\"record\",
\"name\":\"Foo\",\"fields\":[{\"name\":\"x\",\"type\":\"string\"}]}");
var datum = new GenericRecord(schema);
datum.Add("x", new String('x', 257));
byte[] serialized;
using (var ms = new MemoryStream())
{
var enc = new BinaryEncoder(ms);
var writer = new GenericDatumWriter<GenericRecord>(schema);
writer.Write(datum, enc);
serialized = ms.ToArray();
}
using (var ms = new MemoryStream(serialized))
{
var dec = new BinaryDecoder(ms);
var deserialized = new GenericRecord(schema);
var reader = new GenericDatumReader<GenericRecord>(schema, schema);
reader.Read(deserialized, dec);
Assert.Equal(datum, deserialized);
}
}{code}
which yields the following exception
{code:java}
Avro.AvroException
End of stream reached
at Avro.IO.BinaryDecoder.Read(Span`1 buffer)
at Avro.IO.BinaryDecoder.ReadString()
at Avro.Generic.PreresolvingDatumReader`1.<>c.<ResolveReader>b__21_1(Decoder
d)
at
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass37_0.<Read>b__0(Object
r, Decoder d)
at
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_1.<ResolveRecord>b__2(Object
rec, Decoder d)
at Avro.Generic.PreresolvingDatumReader`1.ReadRecord(Object reuse, Decoder
decoder, RecordAccess recordAccess, IEnumerable`1 readSteps)
at
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_0.<ResolveRecord>b__0(Object
r, Decoder d)
at Avro.Generic.PreresolvingDatumReader`1.Read(T reuse, Decoder decoder)
at AvroTests.AvroTests.Test(Int32 n) in
C:\Users\l.heimberg\Source\Repos\AvroTests\AvroTests\AvroTests.cs:line 41
{code}
The reason seems to be the following: when a string of length <=
StackallocThreshold (=256) is read, a buffer to read the content of the string
from the stream is allocated on the stack with the exact length of the string.
If the length is > StackallocThreshold, the buffer is obtained from
ArrayPool<byte>.Shared.Rent(length), which returns a buffer of *minimum* length
'length', but possibly also a larger buffer.
The Read(Span<byte>) method always tries to read as much bytes from the input
stream as this buffer has length, and in particular will fail with the
exception shown above when the stream does not have enough data anymore. Thus,
if the string has expected length > StackallocThreshold, the Read method will
either throw the above AvroException (when the string is the last element in
the stream) or will already consume parts of following data items in the
stream, in any case causing corruption.
A solution to the problem is to either ensure that the buffer has exactly the
length of the string always, or to add another argument to the Read method that
bounds the number of read bytes. With the latter option, the conversion from
the byte buffer back to a string using Encoding.UTF8.GetString(buffer) has to
be changed to Encoding.UTF8.GetString(buffer, 0, length) to ensure that only
the part of the buffer in which the string is actually read is encoded.
Possiby related:
[https://github.com/confluentinc/confluent-kafka-dotnet/issues/1398#issuecomment-748171083]
> Deserialization of string with > 256 characters fails
> -----------------------------------------------------
>
> Key: AVRO-3005
> URL: https://issues.apache.org/jira/browse/AVRO-3005
> Project: Apache Avro
> Issue Type: Bug
> Components: csharp
> Affects Versions: 1.10.1
> Reporter: Lucas Heimberg
> Priority: Major
> Attachments: AVRO-3005.patch
>
>
> Avro.IO.BinaryDecoder.ReadString() fails for strings with length > 256, i.e.
> when the StackallocThreshold is exceeded.
> This can be seen when serializing and subsequently deserializing a
> GenericRecord of schema
> {code:java}
> {
> "type": "record",
> "name": "Foo",
> "fields": [
> { "name": "x", "type": "string" }
> ]
> }{code}
> with a field x containing a string of length > 256, as done in the test case:
> {code:java}
> public void Test()
> {
> var schema = (RecordSchema) Schema.Parse("{ \"type\":\"record\",
> \"name\":\"Foo\",\"fields\":[{\"name\":\"x\",\"type\":\"string\"}]}");
>
> var datum = new GenericRecord(schema);
> datum.Add("x", new String('x', 257));
> byte[] serialized;
> using (var ms = new MemoryStream())
> {
> var enc = new BinaryEncoder(ms);
> var writer = new GenericDatumWriter<GenericRecord>(schema);
> writer.Write(datum, enc);
> serialized = ms.ToArray();
> }
> using (var ms = new MemoryStream(serialized))
> {
> var dec = new BinaryDecoder(ms);
> var deserialized = new GenericRecord(schema);
> var reader = new GenericDatumReader<GenericRecord>(schema, schema);
> reader.Read(deserialized, dec);
> Assert.Equal(datum, deserialized);
> }
> }{code}
> which yields the following exception
> {code:java}
> Avro.AvroException
> End of stream reached
> at Avro.IO.BinaryDecoder.Read(Span`1 buffer)
> at Avro.IO.BinaryDecoder.ReadString()
> at
> Avro.Generic.PreresolvingDatumReader`1.<>c.<ResolveReader>b__21_1(Decoder d)
> at
> Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass37_0.<Read>b__0(Object
> r, Decoder d)
> at
> Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_1.<ResolveRecord>b__2(Object
> rec, Decoder d)
> at Avro.Generic.PreresolvingDatumReader`1.ReadRecord(Object reuse, Decoder
> decoder, RecordAccess recordAccess, IEnumerable`1 readSteps)
> at
> Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_0.<ResolveRecord>b__0(Object
> r, Decoder d)
> at Avro.Generic.PreresolvingDatumReader`1.Read(T reuse, Decoder decoder)
> at AvroTests.AvroTests.Test(Int32 n) in
> C:\Users\l.heimberg\Source\Repos\AvroTests\AvroTests\AvroTests.cs:line 41
> {code}
> The reason seems to be the following: when a string of length <=
> StackallocThreshold (=256) is read, a buffer to read the content of the
> string from the stream is allocated on the stack with the exact length of the
> string. If the length is > StackallocThreshold, the buffer is obtained from
> ArrayPool<byte>.Shared.Rent(length), which returns a buffer of *minimum*
> length 'length', but possibly also a larger buffer.
> The Read(Span<byte>) method always tries to read as much bytes from the input
> stream as this buffer has length, and in particular will fail with the
> exception shown above when the stream does not have enough data anymore.
> Thus, if the string has expected length > StackallocThreshold, the Read
> method will either throw the above AvroException (when the string is the last
> element in the stream) or will already consume parts of following data items
> in the stream, in any case causing corruption.
> The provided patch turns the byte array returned by the ArrayPool into a Span
> with the correct length using the Splice method, instead of casting it
> implicitly to Span<byte>.
>
> Possiby related:
> [https://github.com/confluentinc/confluent-kafka-dotnet/issues/1398#issuecomment-748171083]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)