[GitHub] [arrow] westonpace commented on issue #34712: [C++] Utilities to estimate average (de)serialized row size

via GitHub Fri, 24 Mar 2023 09:01:30 -0700


westonpace commented on issue #34712:
URL: https://github.com/apache/arrow/issues/34712#issuecomment-1483045067


   I think the approach might be different for writing and for reading.  For 
example, for writing, if you wanted your output batches to be a certain size 
(in bytes) then you need to either:
   
    * Use the decoded & uncompressed size (no guessing required)
    * Write to a buffer and then to the file (no guessing required)
    * Guess how much the encodings & compression will reduce the data
   
   However, when reading, your options are more limited.  Typically you want to 
read a batch that has X bytes.  You can't use the decoded & uncompressed size 
(unless that is written in the statistics / metadata somewhere).  You can't 
read-twice in the same way you can write-twice.  You are then left with 
guessing.
   
   However, there is one other approach you can take when reading.  Instead of 
asking your column decoded for X pages or X row groups worth of data you can 
ask your column decoder for X bytes worth of data.  The decoder can then 
advance through as many pages as it needs to deliver X bytes of data.  This is 
a bit tricky because, if you are reading a batch, you might get a different 
number of rows from each decoder.  However, that can be addressed as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #34712: [C++] Utilities to estimate average (de)serialized row size

Reply via email to