ssirovica commented on issue #14007:
URL: https://github.com/apache/arrow/issues/14007#issuecomment-1232319128

   @zeroshade I think I have a break through for the source of the leak after 
digging through the debugger! 
   
   In the MinMax statistics during write of a row group: 
https://github.com/apache/arrow/blob/master/go/parquet/metadata/statistics_types.gen.go#L1900,
 keeping track of the max and min row (by holding the pointer to the byte 
slice) is retaining the memory.
   
   The capacity of the min/max slice are very large (I assume this is the 
buffer) while the length is small (11 our "HelloWorld!"). As a quick hack using 
the same sample program I changed the generated code and copied the bytes. 
Changing the function to:
   ```
   // SetMinMax updates the min and max values only if they are not currently 
set
   // or if argMin is less than the current min / argMax is greater than the 
current max
   func (s *ByteArrayStatistics) SetMinMax(argMin, argMax parquet.ByteArray) {
        maybeMinMax := s.cleanStat([2]parquet.ByteArray{argMin, argMax})
        if maybeMinMax == nil {
                return
        }
   
        min := make([]byte, len((*maybeMinMax)[0]))
        max := make([]byte, len((*maybeMinMax)[1]))
        copy(min, (*maybeMinMax)[0])
        copy(max, (*maybeMinMax)[1])
   
        if !s.hasMinMax {
                s.hasMinMax = true
                s.min = min
                s.max = max
        } else {
                if !s.less(s.min, min) {
                        s.min = min
                }
                if s.less(s.max, max) {
                        s.max = max
                }
        }
   }
   ```
   The sample program then runs without increasing memory I also diff'd the 
generated parquet on a small number of records and it was the same.
   
   
   I think what's missing is understanding why the statistics aren't being 
released on write of a RG and then resolving that over copying. I plan to 
investigate this thread a bit more. The copy example above was just to see if 
we're on the right track.
   
   Let me know if this sounds plausible, would appreciate your thoughts :) 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to