Re: [PR] Utilize memory allocator in ReadProperties.GetStream [arrow-go]

via GitHub Wed, 29 Oct 2025 07:44:35 -0700


daniel-adam-tfs commented on code in PR #547:
URL: https://github.com/apache/arrow-go/pull/547#discussion_r2473541179



##########
parquet/file/page_reader.go:
##########
@@ -500,14 +508,31 @@ func (p *serializedPageReader) Page() Page {
        return p.curPage
 }
 
+func (p *serializedPageReader) readUncompressed(rd io.Reader, lenUncompressed 
int, buf []byte) ([]byte, error) {
+       n, err := io.ReadFull(rd, buf[:lenUncompressed])
+       if err != nil {
+               return nil, err
+       }
+       if n != lenUncompressed {
+               return nil, fmt.Errorf("parquet: expected to read %d bytes but 
only read %d", lenUncompressed, n)
+       }
+       if p.cryptoCtx.DataDecryptor != nil {

Review Comment:
   Alright, I'll fix it for DataPageV2 then. I'll add a unit tests without 
compression and with encryption that should fail with the current main, if I 
have the time.
   
   I reran the profiler with the current commit in this PR, with a 2.8GB 
parquet file stored in S3, uncompressed. And cpu profiler is showing that more 
time is spent in runtime.memmove (copying memory) than in syscall.Syscall6 
(read). Which is annoying me. :-D
   
   I think it should be still possible to eliminate at least one copy for the 
uncompressed case.
   
   So this is my scenario:
   1) `ReaderProperties.GetStream` reads column chunk from a TLS and stores it 
in a buffer (or just allocates the buffer if `BufferedStreamEnabled`, but lets 
go with the unbuffered case for now)
   2) `serializedPageReader` is created with the buffer returned from 
ReaderProperties.GetStream
   3) `serializedPageReader.Next` get the page header calls 
`serializedPageReader.readUncompressed` / `serializedPageReader.decompress` 
which reads data from the GetStream buffer into dictPageBuffer/dataPageBuffer
     3a) for the uncompressed case the this is just a copy
   4) `page` struct is created from the bytes written to the 
dictPageBuffer/dataPageBuffer
   
   I think I could avoid the copy in 3a and create page directly from the bytes 
in the buffer returned by `ReaderProperties.GetStream` by using combination of 
calls Peek (to get the bytes)+Discard (to move the internal position inside the 
buffer). This should hold when `BufferedStreamEnabled` is false, I have to 
check what happens when it is true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Utilize memory allocator in ReadProperties.GetStream [arrow-go]

Reply via email to