zeroshade commented on code in PR #13068:
URL: https://github.com/apache/arrow/pull/13068#discussion_r867062398
##########
go/parquet/file/page_reader.go:
##########
@@ -611,8 +610,6 @@ func (p *serializedPageReader) Next() bool {
// we don't know this page type, we're allowed to skip
non-data pages
continue
}
-
- p.buf = memory.NewResizableBuffer(p.mem)
Review Comment:
Ok, looking back at the code I notice something: In all cases *except* for a
V2 DataPage, even if the data isn't compressed, we end up copying it out of
`buf` into memory allocated by the `decompressBuffer`. This has two effects:
1. It separates the connection between `p.buf` and the `data` variable we
pass to the data page.
2. It *ensures* that we can't possibly have a situation of leaking memory
because we're not using the passed in memory allocator to create that copy. We
know for a fact that Go allocated that memory for the page (Not necessarily a
good thing, but sure, we can go with that for now. a larger refactor may be a
future idea to improve this). The only time this isn't the case is line 586,
where we do `io.ReadFull(p.r p.buf.Bytes())`.
My original idea here was that we'd utilize the same buffer where possible,
creating a new buffer for `p.buf` each time. This is exemplified by the
`Release` that exists on the `Page` interface. So the easy solution I can see
is to keep your change to create a New buffer for each call into Next, and make
a slight modification in the `DataPageV2` case:
Currently it looks like this in your change.
```go
var data []byte
if compressed {
if levelsBytelen > 0 {
io.ReadFull(p.r, buf.Bytes()[:levelsBytelen])
}
if data, p.err = p.decompress(lenCompressed-levelsBytelen,
buf.Bytes()[levelsBytelen:]); p.err != nil {
return false
}
} else {
io.ReadFull(p.r, buf.Bytes())
data = buf.Bytes()
}
```
And then we call `NewBufferBytes(data)` when creating the page. We can
likely solve the issue by instead doing this:
```go
var pagebuf *memory.Buffer
if compressed {
if levelsBytelen > 0 {
io.ReadFull(p.r, buf.Bytes()[:levelsBytelen])
}
var data []byte
if data, p.err = p.decompress(lenCompressed-levelsBytelen,
buf.Bytes()[levelsBytelen:]); p.err != nil {
return false
}
pagebuf = memory.NewBufferBytes(data)
} else {
io.ReadFull(p.r, buf.Bytes())
pagebuf = buf
pagebuf.Retain()
}
```
And then use pagebuf to construct the `DataPageV2`. That should be
sufficient, it'll still make a new buffer for each call to `Next` and allow the
defer to release the memory, but it allows us to avoid the copy in this case by
just passing the buffer to the Page safely.
Does all of that make sense? I kind of rambled a bit but I hope it all made
sense and came across well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]