[jira] [Updated] (AVRO-3005) Deserialization of string with > 256 characters fails

Lucas Heimberg (Jira) Fri, 18 Dec 2020 10:30:08 -0800


     [ 
https://issues.apache.org/jira/browse/AVRO-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lucas Heimberg updated AVRO-3005:
---------------------------------
    Description: 
Avro.IO.BinaryDecoder.ReadString() fails for strings with length > 256, i.e. 
when the StackallocThreshold is exceeded. 

This can be seen when serializing and subsequently deserializing a 
GenericRecord of schema 
{code:java}
{
  "type": "record",
  "name": "Foo",
  "fields": [
    { "name": "x", "type": "string" }
  ]
}{code}
with a field x containing a string of length > 256, as done in the test case:
{code:java}
public void Test()
{
    var schema = (RecordSchema) Schema.Parse("{ \"type\":\"record\", 
\"name\":\"Foo\",\"fields\":[{\"name\":\"x\",\"type\":\"string\"}]}");
            
    var datum = new GenericRecord(schema);            
    datum.Add("x", new String('x', 257));
    byte[] serialized;
    using (var ms = new MemoryStream())
    {
        var enc = new BinaryEncoder(ms);
        var writer = new GenericDatumWriter<GenericRecord>(schema);
        writer.Write(datum, enc);                
        serialized = ms.ToArray();
    }

    using (var ms = new MemoryStream(serialized))
    {
        var dec = new BinaryDecoder(ms);
        var deserialized = new GenericRecord(schema);
        var reader = new GenericDatumReader<GenericRecord>(schema, schema);
        reader.Read(deserialized, dec);
        Assert.Equal(datum, deserialized);
    }
}{code}
which yields the following exception
{code:java}
Avro.AvroException
End of stream reached
   at Avro.IO.BinaryDecoder.Read(Span`1 buffer)
   at Avro.IO.BinaryDecoder.ReadString()
   at Avro.Generic.PreresolvingDatumReader`1.<>c.<ResolveReader>b__21_1(Decoder 
d)
   at 
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass37_0.<Read>b__0(Object 
r, Decoder d)
   at 
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_1.<ResolveRecord>b__2(Object
 rec, Decoder d)
   at Avro.Generic.PreresolvingDatumReader`1.ReadRecord(Object reuse, Decoder 
decoder, RecordAccess recordAccess, IEnumerable`1 readSteps)
   at 
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_0.<ResolveRecord>b__0(Object
 r, Decoder d)
   at Avro.Generic.PreresolvingDatumReader`1.Read(T reuse, Decoder decoder)
   at AvroTests.AvroTests.Test(Int32 n) in 
C:\Users\l.heimberg\Source\Repos\AvroTests\AvroTests\AvroTests.cs:line 41
{code}
The reason seems to be the following: when a string of length <= 
StackallocThreshold (=256) is read, a buffer to read the content of the string 
from the stream is allocated on the stack with the exact length of the string. 
If the length is > StackallocThreshold, the buffer is obtained from 
ArrayPool<byte>.Shared.Rent(length), which returns a buffer of *minimum* length 
'length', but possibly also a larger buffer.

The Read(Span<byte>) method always tries to read as much bytes from the input 
stream as this buffer has length, and in particular will fail with the 
exception shown above when the stream does not have enough data anymore. Thus, 
if the string has expected length > StackallocThreshold, the Read method will 
either throw the above AvroException (when the string is the last element in 
the stream) or will already consume parts of following data items in the 
stream, in any case causing corruption.

The provided patch turns the byte array returned by the ArrayPool into a Span 
with the correct length using the Splice method, instead of casting it 
implicitly to Span<byte>.

 

Possiby related: 
[https://github.com/confluentinc/confluent-kafka-dotnet/issues/1398#issuecomment-748171083]

 

  was:
Avro.IO.BinaryDecoder.ReadString() fails for strings with length > 256, i.e. 
when the StackallocThreshold is exceeded. 

This can be seen when serializing and subsequently deserializing a 
GenericRecord of schema 
{code:java}
{
  "type": "record",
  "name": "Foo",
  "fields": [
    { "name": "x", "type": "string" }
  ]
}{code}
with a field x containing a string of length > 256, as done in the test case:
{code:java}
public void Test()
{
    var schema = (RecordSchema) Schema.Parse("{ \"type\":\"record\", 
\"name\":\"Foo\",\"fields\":[{\"name\":\"x\",\"type\":\"string\"}]}");
            
    var datum = new GenericRecord(schema);            
    datum.Add("x", new String('x', 257));
    byte[] serialized;
    using (var ms = new MemoryStream())
    {
        var enc = new BinaryEncoder(ms);
        var writer = new GenericDatumWriter<GenericRecord>(schema);
        writer.Write(datum, enc);                
        serialized = ms.ToArray();
    }

    using (var ms = new MemoryStream(serialized))
    {
        var dec = new BinaryDecoder(ms);
        var deserialized = new GenericRecord(schema);
        var reader = new GenericDatumReader<GenericRecord>(schema, schema);
        reader.Read(deserialized, dec);
        Assert.Equal(datum, deserialized);
    }
}{code}
which yields the following exception
{code:java}
Avro.AvroException
End of stream reached
   at Avro.IO.BinaryDecoder.Read(Span`1 buffer)
   at Avro.IO.BinaryDecoder.ReadString()
   at Avro.Generic.PreresolvingDatumReader`1.<>c.<ResolveReader>b__21_1(Decoder 
d)
   at 
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass37_0.<Read>b__0(Object 
r, Decoder d)
   at 
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_1.<ResolveRecord>b__2(Object
 rec, Decoder d)
   at Avro.Generic.PreresolvingDatumReader`1.ReadRecord(Object reuse, Decoder 
decoder, RecordAccess recordAccess, IEnumerable`1 readSteps)
   at 
Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_0.<ResolveRecord>b__0(Object
 r, Decoder d)
   at Avro.Generic.PreresolvingDatumReader`1.Read(T reuse, Decoder decoder)
   at AvroTests.AvroTests.Test(Int32 n) in 
C:\Users\l.heimberg\Source\Repos\AvroTests\AvroTests\AvroTests.cs:line 41
{code}
The reason seems to be the following: when a string of length <= 
StackallocThreshold (=256) is read, a buffer to read the content of the string 
from the stream is allocated on the stack with the exact length of the string. 
If the length is > StackallocThreshold, the buffer is obtained from 
ArrayPool<byte>.Shared.Rent(length), which returns a buffer of *minimum* length 
'length', but possibly also a larger buffer.

The Read(Span<byte>) method always tries to read as much bytes from the input 
stream as this buffer has length, and in particular will fail with the 
exception shown above when the stream does not have enough data anymore. Thus, 
if the string has expected length > StackallocThreshold, the Read method will 
either throw the above AvroException (when the string is the last element in 
the stream) or will already consume parts of following data items in the 
stream, in any case causing corruption.

A solution to the problem is to either ensure that the buffer has exactly the 
length of the string always, or to add another argument to the Read method that 
bounds the number of read bytes. With the latter option, the conversion from 
the byte buffer back to a string using Encoding.UTF8.GetString(buffer) has to 
be changed to Encoding.UTF8.GetString(buffer, 0, length) to ensure that only 
the part of the buffer in which the string is actually read is encoded.

 

Possiby related: 
[https://github.com/confluentinc/confluent-kafka-dotnet/issues/1398#issuecomment-748171083]

 


> Deserialization of string with > 256 characters fails
> -----------------------------------------------------
>
>                 Key: AVRO-3005
>                 URL: https://issues.apache.org/jira/browse/AVRO-3005
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: csharp
>    Affects Versions: 1.10.1
>            Reporter: Lucas Heimberg
>            Priority: Major
>         Attachments: AVRO-3005.patch
>
>
> Avro.IO.BinaryDecoder.ReadString() fails for strings with length > 256, i.e. 
> when the StackallocThreshold is exceeded. 
> This can be seen when serializing and subsequently deserializing a 
> GenericRecord of schema 
> {code:java}
> {
>   "type": "record",
>   "name": "Foo",
>   "fields": [
>     { "name": "x", "type": "string" }
>   ]
> }{code}
> with a field x containing a string of length > 256, as done in the test case:
> {code:java}
> public void Test()
> {
>     var schema = (RecordSchema) Schema.Parse("{ \"type\":\"record\", 
> \"name\":\"Foo\",\"fields\":[{\"name\":\"x\",\"type\":\"string\"}]}");
>             
>     var datum = new GenericRecord(schema);            
>     datum.Add("x", new String('x', 257));
>     byte[] serialized;
>     using (var ms = new MemoryStream())
>     {
>         var enc = new BinaryEncoder(ms);
>         var writer = new GenericDatumWriter<GenericRecord>(schema);
>         writer.Write(datum, enc);                
>         serialized = ms.ToArray();
>     }
>     using (var ms = new MemoryStream(serialized))
>     {
>         var dec = new BinaryDecoder(ms);
>         var deserialized = new GenericRecord(schema);
>         var reader = new GenericDatumReader<GenericRecord>(schema, schema);
>         reader.Read(deserialized, dec);
>         Assert.Equal(datum, deserialized);
>     }
> }{code}
> which yields the following exception
> {code:java}
> Avro.AvroException
> End of stream reached
>    at Avro.IO.BinaryDecoder.Read(Span`1 buffer)
>    at Avro.IO.BinaryDecoder.ReadString()
>    at 
> Avro.Generic.PreresolvingDatumReader`1.<>c.<ResolveReader>b__21_1(Decoder d)
>    at 
> Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass37_0.<Read>b__0(Object
>  r, Decoder d)
>    at 
> Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_1.<ResolveRecord>b__2(Object
>  rec, Decoder d)
>    at Avro.Generic.PreresolvingDatumReader`1.ReadRecord(Object reuse, Decoder 
> decoder, RecordAccess recordAccess, IEnumerable`1 readSteps)
>    at 
> Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_0.<ResolveRecord>b__0(Object
>  r, Decoder d)
>    at Avro.Generic.PreresolvingDatumReader`1.Read(T reuse, Decoder decoder)
>    at AvroTests.AvroTests.Test(Int32 n) in 
> C:\Users\l.heimberg\Source\Repos\AvroTests\AvroTests\AvroTests.cs:line 41
> {code}
> The reason seems to be the following: when a string of length <= 
> StackallocThreshold (=256) is read, a buffer to read the content of the 
> string from the stream is allocated on the stack with the exact length of the 
> string. If the length is > StackallocThreshold, the buffer is obtained from 
> ArrayPool<byte>.Shared.Rent(length), which returns a buffer of *minimum* 
> length 'length', but possibly also a larger buffer.
> The Read(Span<byte>) method always tries to read as much bytes from the input 
> stream as this buffer has length, and in particular will fail with the 
> exception shown above when the stream does not have enough data anymore. 
> Thus, if the string has expected length > StackallocThreshold, the Read 
> method will either throw the above AvroException (when the string is the last 
> element in the stream) or will already consume parts of following data items 
> in the stream, in any case causing corruption.
> The provided patch turns the byte array returned by the ArrayPool into a Span 
> with the correct length using the Splice method, instead of casting it 
> implicitly to Span<byte>.
>  
> Possiby related: 
> [https://github.com/confluentinc/confluent-kafka-dotnet/issues/1398#issuecomment-748171083]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (AVRO-3005) Deserialization of string with > 256 characters fails

Reply via email to