[ 
https://issues.apache.org/jira/browse/PARQUET-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613773#comment-17613773
 ] 

ASF GitHub Bot commented on PARQUET-2199:
-----------------------------------------

songhuicheng opened a new pull request, #1004:
URL: https://github.com/apache/parquet-mr/pull/1004

   Parquet checks Block size after writing records to decide when it shall 
flush. This is relatively expensive, so it estimates the next check based on 
record size, record count etc.
   
   For small records less than 1byte after compression, the average record size 
is 0 after integer division. This caused overflow when calculating the next 
record count for block size check. It results in block size being checked for 
every record and slows down write.
   
   Fix the zero record size issue with float type record size.
   
   Make sure you have checked _all_ steps below.
   




> checkBlockSizeReached zero record size perf issue
> -------------------------------------------------
>
>                 Key: PARQUET-2199
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2199
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 2.0.0
>            Reporter: Huicheng Song
>            Priority: Minor
>
> Parquet checks Block size after writing records to decide when it shall 
> flush. This is relatively expensive, so it estimates the next check based on 
> record size, record count etc.
> For small records (less than 1byte after compression), the average record 
> size is 0 after integer division. This caused overflow when calculating the 
> next record count for block size check, resulting block size being checked 
> for every record.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to