[ https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688257#comment-17688257 ]
ASF GitHub Bot commented on PARQUET-2242: ----------------------------------------- xjlem commented on code in PR #1024: URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1105253278 ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java: ########## @@ -142,6 +142,8 @@ public static enum JobSummaryLevel { public static final String MAX_PADDING_BYTES = "parquet.writer.max-padding"; public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min"; public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max"; + public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = "parquet.block.size.row.check.min"; Review Comment: Because we use the 1.10 version,and I found that this issue has been fixed in [PARQUET-1920](https://issues.apache.org/jira/browse/PARQUET-1920).Thanks for your review and this pr looks like can be closed. > record count for row group size check configurable > --------------------------------------------------- > > Key: PARQUET-2242 > URL: https://issues.apache.org/jira/browse/PARQUET-2242 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr > Affects Versions: 1.10.1 > Reporter: xjlem > Priority: Major > > org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached > {code:java} > private void checkBlockSizeReached() throws IOException { > if (recordCount >= recordCountForNextMemCheck) { // checking the memory > size is relatively expensive, so let's not do it for every record. > long memSize = columnStore.getBufferedSize(); > long recordSize = memSize / recordCount; > // flush the row group if it is within ~2 records of the limit > // it is much better to be slightly under size than to be over at all > if (memSize > (nextRowGroupSize - 2 * recordSize)) { > LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, > nextRowGroupSize, recordCount); > flushRowGroupToStore(); > initStore(); > recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, > recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK); > this.lastRowGroupEndPos = parquetFileWriter.getPos(); > } else { > recordCountForNextMemCheck = min( > max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + > (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway > recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look > more than max records ahead > ); > LOG.debug("Checked mem at {} will check again at: {}", recordCount, > recordCountForNextMemCheck); > } > } > } {code} > in this code,if the block size is small ,for example 8M,and the first 100 > lines record size is small and after 100 lines the record size is big,it > will cause big row group,in our real scene,it will more than 64M. So i think > the size for block check can configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)