[jira] [Commented] (PARQUET-2242) record count for row group size check configurable

ASF GitHub Bot (Jira) Mon, 13 Feb 2023 00:53:04 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687809#comment-17687809
 ]


ASF GitHub Bot commented on PARQUET-2242:
-----------------------------------------

xjlem commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104154913


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java:
##########
@@ -147,12 +152,12 @@ private void checkBlockSizeReached() throws IOException {
         LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, 
nextRowGroupSize, recordCount);
         flushRowGroupToStore();
         initStore();
-        recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, 
recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
+        recordCountForNextMemCheck = min(max(minRowCountForBlockSizeCheck, 
recordCount / 2), maxRowCountForBlockSizeCheck);

Review Comment:
   Yes,it's like the config 
'parquet.page.size.row.check.min'、'parquet.page.size.row.check.max'.
   The rowgroup check algorithm is like the page check algorithm with config 
set ‘parquet.page.size.check.estimate’ true .





> record count for  row group size check configurable
> ---------------------------------------------------
>
>                 Key: PARQUET-2242
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2242
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: xjlem
>            Priority: Major
>
>  org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached
> {code:java}
>  private void checkBlockSizeReached() throws IOException {
>     if (recordCount >= recordCountForNextMemCheck) { // checking the memory 
> size is relatively expensive, so let's not do it for every record.
>       long memSize = columnStore.getBufferedSize();
>       long recordSize = memSize / recordCount;
>       // flush the row group if it is within ~2 records of the limit
>       // it is much better to be slightly under size than to be over at all
>       if (memSize > (nextRowGroupSize - 2 * recordSize)) {
>         LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, 
> nextRowGroupSize, recordCount);
>         flushRowGroupToStore();
>         initStore();
>         recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, 
> recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
>         this.lastRowGroupEndPos = parquetFileWriter.getPos();
>       } else {
>         recordCountForNextMemCheck = min(
>             max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + 
> (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway
>             recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look 
> more than max records ahead
>             );
>         LOG.debug("Checked mem at {} will check again at: {}", recordCount, 
> recordCountForNextMemCheck);
>       }
>     }
>   } {code}
> in this code，if the block size is small ,for example 8M,and the first 100 
> lines record size is small and  after 100 lines the record size is big，it 
> will cause big row group，in our real scene，it will more than 64M. So i think 
> the size for block check can configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2242) record count for row group size check configurable

Reply via email to