ssirovica commented on issue #14007: URL: https://github.com/apache/arrow/issues/14007#issuecomment-1232319128
@zeroshade I think I have a break through for the source of the leak after digging through the debugger! In the MinMax statistics during write of a row group: https://github.com/apache/arrow/blob/master/go/parquet/metadata/statistics_types.gen.go#L1900, keeping track of the max and min row (by holding the pointer to the byte slice) is retaining the memory. The capacity of the min/max slice are very large (I assume this is the buffer) while the length is small (11 our "HelloWorld!"). As a quick hack using the same sample program I changed the generated code and copied the bytes. Changing the function to: ``` // SetMinMax updates the min and max values only if they are not currently set // or if argMin is less than the current min / argMax is greater than the current max func (s *ByteArrayStatistics) SetMinMax(argMin, argMax parquet.ByteArray) { maybeMinMax := s.cleanStat([2]parquet.ByteArray{argMin, argMax}) if maybeMinMax == nil { return } min := make([]byte, len((*maybeMinMax)[0])) max := make([]byte, len((*maybeMinMax)[1])) copy(min, (*maybeMinMax)[0]) copy(max, (*maybeMinMax)[1]) if !s.hasMinMax { s.hasMinMax = true s.min = min s.max = max } else { if !s.less(s.min, min) { s.min = min } if s.less(s.max, max) { s.max = max } } } ``` The sample program then runs without increasing memory I also diff'd the generated parquet on a small number of records and it was the same. I think what's missing is understanding why the statistics aren't being released on write of a RG and then resolving that over copying. I plan to investigate this thread a bit more. The copy example above was just to see if we're on the right track. Let me know if this sounds plausible, would appreciate your thoughts :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
