matthewmcnew opened a new issue, #39870:
URL: https://github.com/apache/arrow/issues/39870

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   There does not appear to be an accurate way to identify or estimate the size 
of the current row group with `pqarrow.FileWriter`. 
   
   `RowGroupTotalCompressedBytes()`provides the total bytes from [created data 
pages](https://github.com/apache/arrow/blob/main/go/parquet/file/column_writer.go#L334)
 but, when the [dictionary page size limit is reached 
](https://github.com/apache/arrow/blob/main/go/parquet/file/column_writer_types.gen.go.tmpl#L240-L242)
 the buffered data pages are flushed and the [total size is reset to 
"0"](https://github.com/apache/arrow/blob/main/go/parquet/file/column_writer.go#L400).
 This means the RowGroupTotalCompressedBytes will only provide the size of 
pages created after the dictionary page size was reached. Ideally the size the 
TotalCompressedBytes size should include all created data pages. 
   
   `RowGroupTotalBytesWritten()` will provide the total bytes of DataPages when 
[they are 
written](https://github.com/apache/arrow/blob/main/go/parquet/file/column_writer.go#L461)
 but, not if the the page is buffered due to the [dictionary page still being 
created](https://github.com/apache/arrow/blob/main/go/parquet/file/column_writer.go#L330).
 This causes the `RowGroupTotalBytesWritten` to inaccurately provide a "0" 
bytes estimate until the dictionary page size limit is reached.
   
   Perhaps related to: https://github.com/apache/arrow/issues/39789. 
   
   ### Component(s)
   
   Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to