Leitzler created THRIFT-5828:
--------------------------------

             Summary: ReadBinary in the binary protocol implementation 
over-allocates
                 Key: THRIFT-5828
                 URL: https://issues.apache.org/jira/browse/THRIFT-5828
             Project: Thrift
          Issue Type: Bug
          Components: Go - Library
    Affects Versions: 0.21.0, 0.20.0, 0.19.0, 0.18.1, 0.18.0, 0.17.0, 0.16.0, 
0.14.2, 0.14.1, 0.15.0, 0.14.0
         Environment: Benchmarked using macOS w/ M1 Ultra (arm64).
            Reporter: Leitzler


*Background*

----

Prior to 0.14.0, the binary protocol implementation of ReadBinary(..)
 in Go allocated as much memory as the caller asked for causing CVE-2019-11939.

This was fixed in [PR #2292|https://github.com/apache/thrift/pull/2292] where 
we now, instead of allocating everything up front:
{code:go}
buf := make([]byte, isize)
_, err := io.ReadFull(p.trans, buf)
{code}
.. use io.CopyN:
{code:go}
buf := new(bytes.Buffer)
_, err := io.CopyN(buf, trans, int64(size))
{code}
 .. which prevents allocating insanely large amount of memory.

*The issue*

----

The combination of bytes.Buffer and io.CopyN, however, comes with a subtle 
caveat: it ensures that we _always_ over-allocate.

If the destination writer provided to [io.CopyN|https://pkg.go.dev/io#CopyN] 
implements io.ReaderFrom, which bytes.Buffer do, copy will use that.

Inside [ReadFrom(...) in a 
bytes.Buffer|https://cs.opensource.google/go/go/+/refs/tags/go1.23.2:src/bytes/buffer.go;l=200-225],
 we unconditionally start with growing the buffer so that at least 512 bytes 
will fit before each Read(..) call. And each time we grow the internal slice 
inside a bytes.Buffer it doubles in size. This causes additional allocations in 
the end, as shown by this benchmark test:

{code:go}
package main

import (
        "bytes"
        "crypto/rand"
        "io"
        "testing"
)

var tt = []struct {
        dataSize  int
        askedSize int32
}{
        {100, 100},
        {32 * 1024, 32 * 1024},               // 32k
        {64 * 1024, 64 * 1024},               // 64k
        {512 * 1024, 512 * 1024},             // 512k
        {1024 * 1024, 1024 * 1024},           // 1M
        {5 * 1024 * 1024, 5 * 1024 * 1024},   // 5M
        {10 * 1024 * 1024, 10 * 1024 * 1024}, // 10M
        {40 * 1024 * 1024, 40 * 1024 * 1024}, // 40M
}

func TestCurrent(t *testing.T) {
        for _, tc := range tt {
                bm := testing.Benchmark(func(b *testing.B) {
                        data := make([]byte, tc.dataSize)
                        if _, err := io.ReadFull(rand.Reader, data); err != nil 
{
                                b.Fatal(err)
                        }
                        b.ResetTimer()
                        for range b.N {
                                safeReadBytes(tc.askedSize, 
bytes.NewReader(data))
                        }
                })
                t.Logf("%s - ask: %d, data: %d", bm.MemString(), tc.askedSize, 
tc.dataSize)
        }
}
{code}
{code}
    main_test.go:36:     1656 B/op             5 allocs/op - ask: 100, data: 100
    main_test.go:36:   130680 B/op            11 allocs/op - ask: 32768, data: 
32768
    main_test.go:36:   261753 B/op            12 allocs/op - ask: 65536, data: 
65536
    main_test.go:36:  2096768 B/op            15 allocs/op - ask: 524288, data: 
524288
    main_test.go:36:  4193923 B/op            16 allocs/op - ask: 1048576, 
data: 1048576
    main_test.go:36: 16776833 B/op            18 allocs/op - ask: 5242880, 
data: 5242880
    main_test.go:36: 33554060 B/op            19 allocs/op - ask: 10485760, 
data: 10485760
    main_test.go:36: 134217366 B/op           21 allocs/op - ask: 41943040, 
data: 41943040
{code}

As the actual data size grows so does the over-allocation.

*Improvement*

----

What we could do, is to use the same approach ReadString() used prior to 
v0.14.0 (slightly modified to ensure it works as current solution):
{code:go}
func safeReadBytes(size int32, trans io.Reader) ([]byte, error) {
    if size < 0 {
        return nil, nil
    }

    const readLimit = 32 * 1024
    if size < readLimit {
        b := make([]byte, size)
        n, err := io.ReadFull(trans, b)
        return b[:n], err
    }

    b := make([]byte, readLimit)
    var buf bytes.Buffer
    var e error
    var n int
    for size > 0 {
        n, e = io.ReadFull(trans, b)
        buf.Write(b[:n])
        if e != nil {
            break
        }
        size -= readLimit
        if size < readLimit && size > 0 {
            b = b[:size]
        }
    }
    return buf.Bytes(), e
}
{code}
It will reduce number of allocations, and the amount of memory allocated for 
small payloads:
{code}
    main_test.go:44:      160 B/op             2 allocs/op - ask: 100, data: 100
    main_test.go:44:    65584 B/op             3 allocs/op - ask: 32768, data: 
32768
    main_test.go:44:   131120 B/op             4 allocs/op - ask: 65536, data: 
65536
    main_test.go:44:  1048628 B/op             7 allocs/op - ask: 524288, data: 
524288
    main_test.go:44:  2097203 B/op             8 allocs/op - ask: 1048576, 
data: 1048576
    main_test.go:44: 16777267 B/op            11 allocs/op - ask: 5242880, 
data: 5242880
    main_test.go:44: 33554481 B/op            12 allocs/op - ask: 10485760, 
data: 10485760
    main_test.go:44: 134217777 B/op           14 allocs/op - ask: 41943040, 
data: 41943040
{code}
But as the payload size increase we still over-allocate producing a lot of 
garbage for the garbage collector.

*Proposal*

----

The whole purpose of the original change is to avoid allocating a lot of memory 
when the message is malformed, an edge case. I propose that we optimise for the 
ordinary case, allowing the edge case to perform slightly worse by instead 
using a large read limit.

Ideally we would use TConfiguration that was added after PR2292 (in [PR 
#2296|https://github.com/apache/thrift/pull/2296]), there is already a default 
max message size at 100MB for example. The max message size is already checked 
before reading so ReadBinary could be changed back to always allocating the 
full size.

If the max message size is considered too large, and we cannot add a new 
configuration to TConfiguration (as it must be generic), I propose that we use 
a large buffer.

Using 10MB instead of 32k in the example above would improve the ordinary flow 
up to 10MB a lot:
{code}
    main_test.go:44:      160 B/op             2 allocs/op - ask: 100, data: 100
    main_test.go:44:    32816 B/op             2 allocs/op - ask: 32768, data: 
32768
    main_test.go:44:    65584 B/op             2 allocs/op - ask: 65536, data: 
65536
    main_test.go:44:   524338 B/op             2 allocs/op - ask: 524288, data: 
524288
    main_test.go:44:  1048625 B/op             2 allocs/op - ask: 1048576, 
data: 1048576
    main_test.go:44:  5242930 B/op             2 allocs/op - ask: 5242880, 
data: 5242880
    main_test.go:44: 20971574 B/op             3 allocs/op - ask: 10485760, 
data: 10485760
    main_test.go:44: 83886131 B/op             5 allocs/op - ask: 41943040, 
data: 41943040
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to