[GitHub] [arrow] zeroshade commented on a diff in pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

GitBox Fri, 06 May 2022 11:14:30 -0700


zeroshade commented on code in PR #13068:
URL: https://github.com/apache/arrow/pull/13068#discussion_r867062398



##########
go/parquet/file/page_reader.go:
##########
@@ -611,8 +610,6 @@ func (p *serializedPageReader) Next() bool {
                        // we don't know this page type, we're allowed to skip 
non-data pages
                        continue
                }
-
-               p.buf = memory.NewResizableBuffer(p.mem)

Review Comment:
   Ok, looking back at the code I notice something: In all cases *except* for a 
V2 DataPage, even if the data isn't compressed, we end up copying it out of 
`buf` into memory allocated by the `decompressBuffer`. This has two effects:
   
   1. It separates the connection between `p.buf` and the `data` variable we 
pass to the data page.
   2. It *ensures* that we can't possibly have a situation of leaking memory 
because we're not using the passed in memory allocator to create that copy. We 
know for a fact that Go allocated that memory for the page (Not necessarily a 
good thing, but sure, we can go with that for now. a larger refactor may be a 
future idea to improve this). The only time this isn't the case is line 586, 
where we do `io.ReadFull(p.r p.buf.Bytes())`. 
   
   My original idea here was that we'd utilize the same buffer where possible, 
creating a new buffer for `p.buf` each time. This is exemplified by the 
`Release` that exists on the `Page` interface. So the easy solution I can see 
is to keep your change to create a New buffer for each call into Next, and make 
a slight modification in the `DataPageV2` case:
   
   Currently it looks like this in your change.
   
   ```go
   var data []byte
   if compressed {
           if levelsBytelen > 0 {
                   io.ReadFull(p.r, buf.Bytes()[:levelsBytelen])
           }
        if data, p.err = p.decompress(lenCompressed-levelsBytelen, 
buf.Bytes()[levelsBytelen:]); p.err != nil {
                return false
           }
   } else {
           io.ReadFull(p.r, buf.Bytes())
        data = buf.Bytes()
   }
   ```
   
   And then we call `NewBufferBytes(data)` when creating the page. We can 
likely solve the issue by instead doing this:
   
   ```go
   var pagebuf *memory.Buffer
   if compressed {
           if levelsBytelen > 0 {
                   io.ReadFull(p.r, buf.Bytes()[:levelsBytelen])
           }
           var data []byte
        if data, p.err = p.decompress(lenCompressed-levelsBytelen, 
buf.Bytes()[levelsBytelen:]); p.err != nil {
                return false
           }
           pagebuf = memory.NewBufferBytes(data)
   } else {
           io.ReadFull(p.r, buf.Bytes())
           pagebuf = buf
           pagebuf.Retain()
   }
   ```
   
   And then use pagebuf to construct the `DataPageV2`. That should be 
sufficient, it'll still make a new buffer for each call to `Next` and allow the 
defer to release the memory, but it allows us to avoid the copy in this case by 
just passing the buffer to the Page safely.
   
   Does all of that make sense? I kind of rambled a bit but I hope it all made 
sense and came across well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] zeroshade commented on a diff in pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Reply via email to