[ 
https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasha Sirovica updated ARROW-17573:
-----------------------------------
    Description: 
When using `arrow.BinaryTypes.String` in a schema, appending multiple strings, 
and then writing a record out to parquet the memory of the program continuously 
increases. This also applies for the other `arrow.BinaryTypes` 

 

I took a heap dump midway through the program and the majority of allocations 
comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
before terminating the program.

 

I was not able to replicate this behavior with just PrimativeTypes. Another 
interesting point, if the records are created but never written with pqarrow 
memory does not grow. In the below program commenting out `w.Write(rec)` will 
not cause memory issues.

Example program which causes memory to leak:
{code:java}
package main

import (
   "os"

   "github.com/apache/arrow/go/v9/arrow"
   "github.com/apache/arrow/go/v9/arrow/array"
   "github.com/apache/arrow/go/v9/arrow/memory"
   "github.com/apache/arrow/go/v9/parquet"
   "github.com/apache/arrow/go/v9/parquet/compress"
   "github.com/apache/arrow/go/v9/parquet/pqarrow"
)

func main() {
   f, _ := os.Create("/tmp/test.parquet")

   arrowProps := pqarrow.DefaultWriterProps()
   schema := arrow.NewSchema(
      []arrow.Field{
         {Name: "aString", Type: arrow.BinaryTypes.String},
      },
      nil,
   )
   w, _ := pqarrow.NewFileWriter(schema, f, 
parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
arrowProps)

   builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
   for i := 1; i < 5000000000; i++ {
      builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
      if i%2000000 == 0 {
         // Write row groups out every 2M times
         rec := builder.NewRecord()
         w.Write(rec)
         rec.Release()
      }
   }
   w.Close()
}{code}
 

  was:
When using `arrow.BinaryTypes.String` in a schema, appending multiple strings, 
and then writing a record out to parquet the memory of the program continuously 
increases. This also applies for the other `arrow.BinaryTypes` 

  

I took a heap dump midway through the program and the majority of allocations 
comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
before terminating the program. 

  

I was not able to replicate this behavior with just PrimativeTypes. Another 
interesting point, if the records are created but never written with pqarrow 
memory does not grow. In the below program commenting out `w.Write(rec)` will 
not cause memory issues. 

Example program which causes memory to leak: 
{code:java}
package main 

import ( 
   "os" 
   "testing" 

   "github.com/apache/arrow/go/v9/arrow" 
   "github.com/apache/arrow/go/v9/arrow/array" 
   "github.com/apache/arrow/go/v9/arrow/memory" 
   "github.com/apache/arrow/go/v9/parquet" 
   "github.com/apache/arrow/go/v9/parquet/compress" 
   "github.com/apache/arrow/go/v9/parquet/pqarrow" 
) 

func main() { 
   f, _ := os.Create("/tmp/test.parquet") 

   arrowProps := pqarrow.DefaultWriterProps() 
   schema := arrow.NewSchema( 
      []arrow.Field{ 
         {Name: "aString", Type: arrow.BinaryTypes.String}, 
      }, 
      nil, 
   ) 
   w, _ := pqarrow.NewFileWriter(schema, f, 
parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
arrowProps) 

   builder := array.NewRecordBuilder(memory.DefaultAllocator, schema) 
   for i := 1; i < 50000000; i++ { 
      builder.Field(0).(*array.StringBuilder).Append("HelloWorld!") 
      if i%2000000 == 0 { 
         // Write row groups out every 2M times 
         rec := builder.NewRecord() 
         w.Write(rec) 
         rec.Release() 
      } 
   } 
   w.Close() 
}{code}
 


> [Go] String Binary Builder Leaks Memory When Writing to Parquet
> ---------------------------------------------------------------
>
>                 Key: ARROW-17573
>                 URL: https://issues.apache.org/jira/browse/ARROW-17573
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Go
>    Affects Versions: 9.0.0
>            Reporter: Sasha Sirovica
>            Priority: Major
>
> When using `arrow.BinaryTypes.String` in a schema, appending multiple 
> strings, and then writing a record out to parquet the memory of the program 
> continuously increases. This also applies for the other `arrow.BinaryTypes` 
>  
> I took a heap dump midway through the program and the majority of allocations 
> comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
> before terminating the program.
>  
> I was not able to replicate this behavior with just PrimativeTypes. Another 
> interesting point, if the records are created but never written with pqarrow 
> memory does not grow. In the below program commenting out `w.Write(rec)` will 
> not cause memory issues.
> Example program which causes memory to leak:
> {code:java}
> package main
> import (
>    "os"
>    "github.com/apache/arrow/go/v9/arrow"
>    "github.com/apache/arrow/go/v9/arrow/array"
>    "github.com/apache/arrow/go/v9/arrow/memory"
>    "github.com/apache/arrow/go/v9/parquet"
>    "github.com/apache/arrow/go/v9/parquet/compress"
>    "github.com/apache/arrow/go/v9/parquet/pqarrow"
> )
> func main() {
>    f, _ := os.Create("/tmp/test.parquet")
>    arrowProps := pqarrow.DefaultWriterProps()
>    schema := arrow.NewSchema(
>       []arrow.Field{
>          {Name: "aString", Type: arrow.BinaryTypes.String},
>       },
>       nil,
>    )
>    w, _ := pqarrow.NewFileWriter(schema, f, 
> parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
> arrowProps)
>    builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
>    for i := 1; i < 5000000000; i++ {
>       builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
>       if i%2000000 == 0 {
>          // Write row groups out every 2M times
>          rec := builder.NewRecord()
>          w.Write(rec)
>          rec.Release()
>       }
>    }
>    w.Close()
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to