mapleFU commented on issue #33652: URL: https://github.com/apache/arrow/issues/33652#issuecomment-1401923263
@wgtmac @wjones127 To detail explain this problem, let me assume a schema: `(a: required int, b: required int)`. Assume the data below: ``` a: | Dict Page | Page 1 | Page 2 | Unbuffered Written Values | b: | Page 1 | Page 2 | Page 3 | Unbuffered Written Values | ``` So, we have a `RowGroupWriter`, and two `ColumnWriter` here. Each column writer holds a PageWriter. There are two kinds of page writer: 1. `SerializedPageWriter`, which just write "compressed page" to sink. 2. `BufferedPageWriter`, which buffer the writing pages, and write multiple compressed page to sink. So, assume we want to get "currently buffered value size" and "Unbuffered estimated size", `sink_.Tell()` will not be enough, because we have buffered page writer, which haven't write to `sink`. We still have interface in this way, yes, `total_bytes_written`. But, as you can see, it's just "uncompressed size". For example, assume `column b` uses "DELTA_BINARY_PACKED`, it size would be 50B, but the `total_bytes_written` may be 5KB, which cannot tell use the size it written -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
