westonpace commented on issue #34712:
URL: https://github.com/apache/arrow/issues/34712#issuecomment-1483045067
I think the approach might be different for writing and for reading. For
example, for writing, if you wanted your output batches to be a certain size
(in bytes) then you need to either:
* Use the decoded & uncompressed size (no guessing required)
* Write to a buffer and then to the file (no guessing required)
* Guess how much the encodings & compression will reduce the data
However, when reading, your options are more limited. Typically you want to
read a batch that has X bytes. You can't use the decoded & uncompressed size
(unless that is written in the statistics / metadata somewhere). You can't
read-twice in the same way you can write-twice. You are then left with
guessing.
However, there is one other approach you can take when reading. Instead of
asking your column decoded for X pages or X row groups worth of data you can
ask your column decoder for X bytes worth of data. The decoder can then
advance through as many pages as it needs to deliver X bytes of data. This is
a bit tricky because, if you are reading a batch, you might get a different
number of rows from each decoder. However, that can be addressed as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]