mapleFU commented on issue #33652:
URL: https://github.com/apache/arrow/issues/33652#issuecomment-1401923263

   @wgtmac @wjones127 
   
   To detail explain this problem, let me assume a schema: `(a: required int, 
b: required int)`.
   
   Assume the data below:
   
   ```
   a:  | Dict Page | Page 1 |  Page 2 | Unbuffered Written Values |
   b:  | Page 1 |  Page 2 | Page 3 | Unbuffered Written Values |
   ```
   
   So, we have a `RowGroupWriter`, and two `ColumnWriter` here. Each column 
writer holds a PageWriter.
   
   There are two kinds of page writer:
   1. `SerializedPageWriter`, which just write "compressed page" to sink.
   2. `BufferedPageWriter`, which buffer the writing pages, and write multiple 
compressed page to sink.
   
   So, assume we want to get "currently buffered value size" and "Unbuffered 
estimated size", `sink_.Tell()` will not be enough, because we have buffered 
page writer, which haven't write to `sink`.
   
   We still have interface in this way, yes, `total_bytes_written`. But, as you 
can see, it's just "uncompressed size". For example, assume `column b` uses 
"DELTA_BINARY_PACKED`, it size would be 50B, but the `total_bytes_written` may 
be 5KB, which cannot tell use the size it written
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to