[ https://issues.apache.org/jira/browse/ARROW-17473 ]


    Sasha Sirovica deleted comment on ARROW-17473:
    ----------------------------------------

was (Author: JIRAUSER294638):
cc [~zeroshade] I see you've been doing some great work in this area! You might 
find this interesting.

 

I have been looking into this issue myself but so far have not found the root 
cause

> [Go] String Binary Builder Leaks Memory When Writing to Parquet
> ---------------------------------------------------------------
>
>                 Key: ARROW-17473
>                 URL: https://issues.apache.org/jira/browse/ARROW-17473
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Go
>    Affects Versions: 9.0.0
>         Environment: Mac
>            Reporter: Sasha Sirovica
>            Priority: Major
>
> When using `arrow.BinaryTypes.String` in a schema, appending multiple 
> strings, and then writing a record out to parquet the memory of the program 
> continuously increases.
>  
> I took a heap dump on my computer midway through the program and the majority 
> of allocations comes from `StringBuilder.Append`. I approached 16GB of RAM 
> before terminating the program.
>  
> I was not able to replicate this behavior with just PrimativeTypes. Another 
> interesting point, if the records are created but never written with pqarrow 
> there are also no memory leaks. In the below program commenting out 
> `w.Write(rec)` will not cause memory issues.
>  
> Example program which causes memory to leak:
> {code:java}
> package main
> import (
>    "os"
>    "testing"
>    "github.com/apache/arrow/go/v9/arrow"
>    "github.com/apache/arrow/go/v9/arrow/array"
>    "github.com/apache/arrow/go/v9/arrow/memory"
>    "github.com/apache/arrow/go/v9/parquet"
>    "github.com/apache/arrow/go/v9/parquet/compress"
>    "github.com/apache/arrow/go/v9/parquet/pqarrow"
> )
> func main() {
>    f, _ := os.Create("/tmp/test.parquet")
>    arrowProps := pqarrow.DefaultWriterProps()
>    schema := arrow.NewSchema(
>       []arrow.Field{
>          {Name: "aString", Type: arrow.BinaryTypes.String},
>       },
>       nil,
>    )
>    w, _ := pqarrow.NewFileWriter(schema, f, 
> parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
> arrowProps)
>    builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
>    for i := 1; i < 50000000; i++ {
>       builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
>       if i%2000000 == 0 {
>          // Write row groups out every 2M times
>          rec := builder.NewRecord()
>          w.Write(rec)
>          rec.Release()
>       }
>    }
>    w.Close()
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to