[ 
https://issues.apache.org/jira/browse/PARQUET-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394605#comment-17394605
 ] 

Gabor Szadovszky commented on PARQUET-2073:
-------------------------------------------

[~JiangYang], you're right, {{rowsToFillPage}} will always be zero. It means 
(because of [line 
256|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L256])
 that we never use this estimate correctly so the next row count check will 
always step by {{props.getMinRowCountForPageSizeCheck()}}. Funny that it was 
working this way ever since we have this estimation logic. Strange that no one 
have ever noticed.

About fixing this issue. We can have proper results without casting:
{code:java}
rows * remainingMem / usedMem
{code}
Meanwhile, this form is a bit misleading so we need some comments that we are 
calculating the estimated number of rows can be written to the page based on 
the average size of rows already written.

The tricky part is how to test it. This will be a new behavior of the page 
writing and we have never tested this properly. (Otherwise, we would have 
caught this issue.) It highly depends on the characteristics of the values if 
this approach works fine or not. (For example small values at the beginning and 
large ones later can cause this logic overrun the maximum size of the page. 
However, the same can happen if the wrong values are used for 
{{min/maxRowCountForPageSizeCheck}}.)

Sure, please, create a PR. I am happy to review.

> Is there something wrong calculate usedMem in ColumnWriteStoreBase.java
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2073
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2073
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: JiangYang
>            Priority: Critical
>         Attachments: image-2021-08-05-14-37-51-299.png
>
>
> !image-2021-08-05-14-37-51-299.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to