[jira] [Commented] (PARQUET-2242) record count for row group size check configurable

ASF GitHub Bot (Jira) Mon, 13 Feb 2023 07:19:15 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687987#comment-17687987
 ]


ASF GitHub Bot commented on PARQUET-2242:
-----------------------------------------

wgtmac commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104614696


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##########
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
   public static final String MAX_PADDING_BYTES    = 
"parquet.writer.max-padding";
   public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.min";
   public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.max";
+  public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = 
"parquet.block.size.row.check.min";

Review Comment:
   Why fixing this on an old branch but not on the master?





> record count for  row group size check configurable
> ---------------------------------------------------
>
>                 Key: PARQUET-2242
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2242
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: xjlem
>            Priority: Major
>
>  org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached
> {code:java}
>  private void checkBlockSizeReached() throws IOException {
>     if (recordCount >= recordCountForNextMemCheck) { // checking the memory 
> size is relatively expensive, so let's not do it for every record.
>       long memSize = columnStore.getBufferedSize();
>       long recordSize = memSize / recordCount;
>       // flush the row group if it is within ~2 records of the limit
>       // it is much better to be slightly under size than to be over at all
>       if (memSize > (nextRowGroupSize - 2 * recordSize)) {
>         LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, 
> nextRowGroupSize, recordCount);
>         flushRowGroupToStore();
>         initStore();
>         recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, 
> recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
>         this.lastRowGroupEndPos = parquetFileWriter.getPos();
>       } else {
>         recordCountForNextMemCheck = min(
>             max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + 
> (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway
>             recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look 
> more than max records ahead
>             );
>         LOG.debug("Checked mem at {} will check again at: {}", recordCount, 
> recordCountForNextMemCheck);
>       }
>     }
>   } {code}
> in this code，if the block size is small ,for example 8M,and the first 100 
> lines record size is small and  after 100 lines the record size is big，it 
> will cause big row group，in our real scene，it will more than 64M. So i think 
> the size for block check can configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2242) record count for row group size check configurable

Reply via email to