Sasha Sirovica created ARROW-17573:
--------------------------------------
Summary: [Go] String Binary Builder Leaks Memory When Writing to
Parquet
Key: ARROW-17573
URL: https://issues.apache.org/jira/browse/ARROW-17573
Project: Apache Arrow
Issue Type: Bug
Components: Go
Affects Versions: 9.0.0
Reporter: Sasha Sirovica
When using `arrow.BinaryTypes.String` in a schema, appending multiple strings,
and then writing a record out to parquet the memory of the program continuously
increases. This also applies for the other `arrow.BinaryTypes`
I took a heap dump midway through the program and the majority of allocations
comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM
before terminating the program.
I was not able to replicate this behavior with just PrimativeTypes. Another
interesting point, if the records are created but never written with pqarrow
memory does not grow. In the below program commenting out `w.Write(rec)` will
not cause memory issues.
Example program which causes memory to leak:
{code:java}
package main
import (
"os"
"testing"
"github.com/apache/arrow/go/v9/arrow"
"github.com/apache/arrow/go/v9/arrow/array"
"github.com/apache/arrow/go/v9/arrow/memory"
"github.com/apache/arrow/go/v9/parquet"
"github.com/apache/arrow/go/v9/parquet/compress"
"github.com/apache/arrow/go/v9/parquet/pqarrow"
)
func main() {
f, _ := os.Create("/tmp/test.parquet")
arrowProps := pqarrow.DefaultWriterProps()
schema := arrow.NewSchema(
[]arrow.Field{
{Name: "aString", Type: arrow.BinaryTypes.String},
},
nil,
)
w, _ := pqarrow.NewFileWriter(schema, f,
parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)),
arrowProps)
builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
for i := 1; i < 50000000; i++ {
builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
if i%2000000 == 0 {
// Write row groups out every 2M times
rec := builder.NewRecord()
w.Write(rec)
rec.Release()
}
}
w.Close()
}{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)