daniel-adam-tfs commented on code in PR #547:
URL: https://github.com/apache/arrow-go/pull/547#discussion_r2473541179
##########
parquet/file/page_reader.go:
##########
@@ -500,14 +508,31 @@ func (p *serializedPageReader) Page() Page {
return p.curPage
}
+func (p *serializedPageReader) readUncompressed(rd io.Reader, lenUncompressed
int, buf []byte) ([]byte, error) {
+ n, err := io.ReadFull(rd, buf[:lenUncompressed])
+ if err != nil {
+ return nil, err
+ }
+ if n != lenUncompressed {
+ return nil, fmt.Errorf("parquet: expected to read %d bytes but
only read %d", lenUncompressed, n)
+ }
+ if p.cryptoCtx.DataDecryptor != nil {
Review Comment:
Alright, I'll fix it for DataPageV2 then. I'll add a unit tests without
compression and with encryption that should fail with the current main, if I
have the time.
I reran the profiler with the current commit in this PR, with a 2.8GB
parquet file stored in S3, uncompressed. And cpu profiler is showing that more
time is spent in runtime.memmove (copying memory) than in syscall.Syscall6
(read). Which is annoying me. :-D
I think it should be still possible to eliminate at least one copy for the
uncompressed case.
So this is my scenario:
1) `ReaderProperties.GetStream` reads column chunk from a TLS and stores it
in a buffer (or just allocates the buffer if `BufferedStreamEnabled`, but lets
go with the unbuffered case for now)
2) `serializedPageReader` is created with the buffer returned from
ReaderProperties.GetStream
3) `serializedPageReader.Next` get the page header calls
`serializedPageReader.readUncompressed` / `serializedPageReader.decompress`
which reads data from the GetStream buffer into dictPageBuffer/dataPageBuffer
3a) for the uncompressed case the this is just a copy
4) `page` struct is created from the bytes written to the
dictPageBuffer/dataPageBuffer
I think I could avoid the copy in 3a and create page directly from the bytes
in the buffer returned by `ReaderProperties.GetStream` by using combination of
calls Peek (to get the bytes)+Discard (to move the internal position inside the
buffer). This should hold when `BufferedStreamEnabled` is false, I have to
check what happens when it is true.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]