[jira] [Commented] (PARQUET-2242) record count for row group size check configurable
[ https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688257#comment-17688257 ] ASF GitHub Bot commented on PARQUET-2242: - xjlem commented on code in PR #1024: URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1105253278 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java: ## @@ -142,6 +142,8 @@ public static enum JobSummaryLevel { public static final String MAX_PADDING_BYTES= "parquet.writer.max-padding"; public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min"; public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max"; + public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = "parquet.block.size.row.check.min"; Review Comment: Because we use the 1.10 version,and I found that this issue has been fixed in [PARQUET-1920](https://issues.apache.org/jira/browse/PARQUET-1920).Thanks for your review and this pr looks like can be closed. > record count for row group size check configurable > --- > > Key: PARQUET-2242 > URL: https://issues.apache.org/jira/browse/PARQUET-2242 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: xjlem >Priority: Major > > org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached > {code:java} > private void checkBlockSizeReached() throws IOException { > if (recordCount >= recordCountForNextMemCheck) { // checking the memory > size is relatively expensive, so let's not do it for every record. > long memSize = columnStore.getBufferedSize(); > long recordSize = memSize / recordCount; > // flush the row group if it is within ~2 records of the limit > // it is much better to be slightly under size than to be over at all > if (memSize > (nextRowGroupSize - 2 * recordSize)) { > LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, > nextRowGroupSize, recordCount); > flushRowGroupToStore(); > initStore(); > recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, > recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK); > this.lastRowGroupEndPos = parquetFileWriter.getPos(); > } else { > recordCountForNextMemCheck = min( > max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + > (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway > recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look > more than max records ahead > ); > LOG.debug("Checked mem at {} will check again at: {}", recordCount, > recordCountForNextMemCheck); > } > } > } {code} > in this code,if the block size is small ,for example 8M,and the first 100 > lines record size is small and after 100 lines the record size is big,it > will cause big row group,in our real scene,it will more than 64M. So i think > the size for block check can configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] xjlem commented on a diff in pull request #1024: PARQUET-2242:record count for row group size check configurable
xjlem commented on code in PR #1024: URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1105253278 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java: ## @@ -142,6 +142,8 @@ public static enum JobSummaryLevel { public static final String MAX_PADDING_BYTES= "parquet.writer.max-padding"; public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min"; public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max"; + public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = "parquet.block.size.row.check.min"; Review Comment: Because we use the 1.10 version,and I found that this issue has been fixed in [PARQUET-1920](https://issues.apache.org/jira/browse/PARQUET-1920).Thanks for your review and this pr looks like can be closed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] xjlem closed pull request #1024: PARQUET-2242:record count for row group size check configurable
xjlem closed pull request #1024: PARQUET-2242:record count for row group size check configurable URL: https://github.com/apache/parquet-mr/pull/1024 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls
[ https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688231#comment-17688231 ] ASF GitHub Bot commented on PARQUET-2241: - wgtmac commented on code in PR #1025: URL: https://github.com/apache/parquet-mr/pull/1025#discussion_r1105198239 ## parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java: ## @@ -58,15 +58,19 @@ protected void gatherElementDataFromStreams(byte[] gatheredData) public void initFromPage(int valuesCount, ByteBufferInputStream stream) throws ParquetDecodingException, IOException { Review Comment: Added a check to use `valuesCount` as upper bound. > ByteStreamSplitDecoder broken in presence of nulls > -- > > Key: PARQUET-2241 > URL: https://issues.apache.org/jira/browse/PARQUET-2241 > Project: Parquet > Issue Type: Bug > Components: parquet-format, parquet-mr >Affects Versions: format-2.8.0 >Reporter: Xuwei Fu >Assignee: Gang Wu >Priority: Major > > > This problem is shown in this issue: > [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let] > Let me talk about it briefly: > * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, > but using "num_values" as stride in BYTE_STREAM_SPLIT > * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in > the page, however, in DATA_PAGE_V1, without statistics, we should read > def-levels and rep-levels to get the real num-of-values. And without the > num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly > > The bug-reproducing code is in the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1025: PARQUET-2241: Fix ByteStreamSplitValuesReader with nulls
wgtmac commented on code in PR #1025: URL: https://github.com/apache/parquet-mr/pull/1025#discussion_r1105198239 ## parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java: ## @@ -58,15 +58,19 @@ protected void gatherElementDataFromStreams(byte[] gatheredData) public void initFromPage(int valuesCount, ByteBufferInputStream stream) throws ParquetDecodingException, IOException { Review Comment: Added a check to use `valuesCount` as upper bound. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688227#comment-17688227 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105193381 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { Review Comment: It is tricky to set a max size, whether in terms of number of files or total file sizes. Based on my knowledge, the performance varies when input files have different number of rows, number of columns, number of blocks, block sizes, etc. So it would be better for the caller to do the planning. > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file
wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105193381 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { Review Comment: It is tricky to set a max size, whether in terms of number of files or total file sizes. Based on my knowledge, the performance varies when input files have different number of rows, number of columns, number of blocks, block sizes, etc. So it would be better for the caller to do the planning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688224#comment-17688224 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105191438 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: > Do we do dedup? Yes, `allOriginalCreatedBys` is a HashSet which does the job. > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file
wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105191438 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: > Do we do dedup? Yes, `allOriginalCreatedBys` is a HashSet which does the job. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688043#comment-17688043 ] ASF GitHub Bot commented on PARQUET-758: julienledem commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1428363226 @emkornfield apologies for this, I realize I replied to a thread on the private list. @pitrou emailed private@ to get more eyes on this. Which worked. But yes, this discussion didn't need to be private. I replied: "I would recommend having the corresponding PRs for support in the Java and C++ implementation ready as well for this to be merged so that we don't release a new type that does not have support in the implementation." > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] julienledem commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type
julienledem commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1428363226 @emkornfield apologies for this, I realize I replied to a thread on the private list. @pitrou emailed private@ to get more eyes on this. Which worked. But yes, this discussion didn't need to be private. I replied: "I would recommend having the corresponding PRs for support in the Java and C++ implementation ready as well for this to be merged so that we don't release a new type that does not have support in the implementation." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688030#comment-17688030 ] ASF GitHub Bot commented on PARQUET-2228: - shangxinli commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104769279 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { Review Comment: Do we want to set a max size? > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file
shangxinli commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104769279 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { Review Comment: Do we want to set a max size? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688029#comment-17688029 ] ASF GitHub Bot commented on PARQUET-2228: - shangxinli commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104767997 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: Do we do dedup? > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file
shangxinli commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104767997 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: Do we do dedup? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls
[ https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688027#comment-17688027 ] ASF GitHub Bot commented on PARQUET-2241: - shangxinli commented on code in PR #1025: URL: https://github.com/apache/parquet-mr/pull/1025#discussion_r1104758035 ## parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java: ## @@ -58,15 +58,19 @@ protected void gatherElementDataFromStreams(byte[] gatheredData) public void initFromPage(int valuesCount, ByteBufferInputStream stream) throws ParquetDecodingException, IOException { Review Comment: It seems the input value valueCount is not used anymore > ByteStreamSplitDecoder broken in presence of nulls > -- > > Key: PARQUET-2241 > URL: https://issues.apache.org/jira/browse/PARQUET-2241 > Project: Parquet > Issue Type: Bug > Components: parquet-format, parquet-mr >Affects Versions: format-2.8.0 >Reporter: Xuwei Fu >Assignee: Gang Wu >Priority: Major > > > This problem is shown in this issue: > [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let] > Let me talk about it briefly: > * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, > but using "num_values" as stride in BYTE_STREAM_SPLIT > * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in > the page, however, in DATA_PAGE_V1, without statistics, we should read > def-levels and rep-levels to get the real num-of-values. And without the > num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly > > The bug-reproducing code is in the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1025: PARQUET-2241: Fix ByteStreamSplitValuesReader with nulls
shangxinli commented on code in PR #1025: URL: https://github.com/apache/parquet-mr/pull/1025#discussion_r1104758035 ## parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java: ## @@ -58,15 +58,19 @@ protected void gatherElementDataFromStreams(byte[] gatheredData) public void initFromPage(int valuesCount, ByteBufferInputStream stream) throws ParquetDecodingException, IOException { Review Comment: It seems the input value valueCount is not used anymore -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688001#comment-17688001 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1428175723 @ggershinsky @shangxinli PTAL, thanks! > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file
wgtmac commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1428175723 @ggershinsky @shangxinli PTAL, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2242) record count for row group size check configurable
[ https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687987#comment-17687987 ] ASF GitHub Bot commented on PARQUET-2242: - wgtmac commented on code in PR #1024: URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104614696 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java: ## @@ -142,6 +142,8 @@ public static enum JobSummaryLevel { public static final String MAX_PADDING_BYTES= "parquet.writer.max-padding"; public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min"; public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max"; + public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = "parquet.block.size.row.check.min"; Review Comment: Why fixing this on an old branch but not on the master? > record count for row group size check configurable > --- > > Key: PARQUET-2242 > URL: https://issues.apache.org/jira/browse/PARQUET-2242 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: xjlem >Priority: Major > > org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached > {code:java} > private void checkBlockSizeReached() throws IOException { > if (recordCount >= recordCountForNextMemCheck) { // checking the memory > size is relatively expensive, so let's not do it for every record. > long memSize = columnStore.getBufferedSize(); > long recordSize = memSize / recordCount; > // flush the row group if it is within ~2 records of the limit > // it is much better to be slightly under size than to be over at all > if (memSize > (nextRowGroupSize - 2 * recordSize)) { > LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, > nextRowGroupSize, recordCount); > flushRowGroupToStore(); > initStore(); > recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, > recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK); > this.lastRowGroupEndPos = parquetFileWriter.getPos(); > } else { > recordCountForNextMemCheck = min( > max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + > (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway > recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look > more than max records ahead > ); > LOG.debug("Checked mem at {} will check again at: {}", recordCount, > recordCountForNextMemCheck); > } > } > } {code} > in this code,if the block size is small ,for example 8M,and the first 100 > lines record size is small and after 100 lines the record size is big,it > will cause big row group,in our real scene,it will more than 64M. So i think > the size for block check can configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1024: PARQUET-2242:record count for row group size check configurable
wgtmac commented on code in PR #1024: URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104614696 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java: ## @@ -142,6 +142,8 @@ public static enum JobSummaryLevel { public static final String MAX_PADDING_BYTES= "parquet.writer.max-padding"; public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min"; public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max"; + public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = "parquet.block.size.row.check.min"; Review Comment: Why fixing this on an old branch but not on the master? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687985#comment-17687985 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104610792 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: We have discussed the issue to consolidate original created_by when merging several input files. Now I use a HashSet to make sure there is no duplicate and they are concatenated by `\n`. Does this sound good? @gszadovszky > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file
wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104610792 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: We have discussed the issue to consolidate original created_by when merging several input files. Now I use a HashSet to make sure there is no duplicate and they are concatenated by `\n`. Does this sound good? @gszadovszky -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687977#comment-17687977 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac opened a new pull request, #1026: URL: https://github.com/apache/parquet-mr/pull/1026 ### Jira https://issues.apache.org/jira/browse/PARQUET-2228 ### Tests - Refactor and add various cases to `ParquetRewriterTest` for merging files. ### Commits - RewriteOptions supports merging more than one input file. - Enforce schema equality of all input files. - ParquetRewriter supports various rewrite operations on several input files. > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac opened a new pull request, #1026: PARQUET-2228: ParquetRewriter supports more than one input file
wgtmac opened a new pull request, #1026: URL: https://github.com/apache/parquet-mr/pull/1026 ### Jira https://issues.apache.org/jira/browse/PARQUET-2228 ### Tests - Refactor and add various cases to `ParquetRewriterTest` for merging files. ### Commits - RewriteOptions supports merging more than one input file. - Enforce schema equality of all input files. - ParquetRewriter supports various rewrite operations on several input files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687886#comment-17687886 ] ASF GitHub Bot commented on PARQUET-758: pitrou commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1427745167 > It might have missed it but I didn't see Julien's reply on the dev mailing list. This seems reasonable though. For full disclosure, it was a discussion involving the Parquet PMC and myself after I messaged the PMC to inquire about the way forward on this proposal. > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type
pitrou commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1427745167 > It might have missed it but I didn't see Julien's reply on the dev mailing list. This seems reasonable though. For full disclosure, it was a discussion involving the Parquet PMC and myself after I messaged the PMC to inquire about the way forward on this proposal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2242) record count for row group size check configurable
[ https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687809#comment-17687809 ] ASF GitHub Bot commented on PARQUET-2242: - xjlem commented on code in PR #1024: URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104154913 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java: ## @@ -147,12 +152,12 @@ private void checkBlockSizeReached() throws IOException { LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, nextRowGroupSize, recordCount); flushRowGroupToStore(); initStore(); -recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK); +recordCountForNextMemCheck = min(max(minRowCountForBlockSizeCheck, recordCount / 2), maxRowCountForBlockSizeCheck); Review Comment: Yes,it's like the config 'parquet.page.size.row.check.min'、'parquet.page.size.row.check.max'. The rowgroup check algorithm is like the page check algorithm with config set ‘parquet.page.size.check.estimate’ true . > record count for row group size check configurable > --- > > Key: PARQUET-2242 > URL: https://issues.apache.org/jira/browse/PARQUET-2242 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: xjlem >Priority: Major > > org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached > {code:java} > private void checkBlockSizeReached() throws IOException { > if (recordCount >= recordCountForNextMemCheck) { // checking the memory > size is relatively expensive, so let's not do it for every record. > long memSize = columnStore.getBufferedSize(); > long recordSize = memSize / recordCount; > // flush the row group if it is within ~2 records of the limit > // it is much better to be slightly under size than to be over at all > if (memSize > (nextRowGroupSize - 2 * recordSize)) { > LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, > nextRowGroupSize, recordCount); > flushRowGroupToStore(); > initStore(); > recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, > recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK); > this.lastRowGroupEndPos = parquetFileWriter.getPos(); > } else { > recordCountForNextMemCheck = min( > max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + > (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway > recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look > more than max records ahead > ); > LOG.debug("Checked mem at {} will check again at: {}", recordCount, > recordCountForNextMemCheck); > } > } > } {code} > in this code,if the block size is small ,for example 8M,and the first 100 > lines record size is small and after 100 lines the record size is big,it > will cause big row group,in our real scene,it will more than 64M. So i think > the size for block check can configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] xjlem commented on a diff in pull request #1024: PARQUET-2242:record count for row group size check configurable
xjlem commented on code in PR #1024: URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104154913 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java: ## @@ -147,12 +152,12 @@ private void checkBlockSizeReached() throws IOException { LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, nextRowGroupSize, recordCount); flushRowGroupToStore(); initStore(); -recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK); +recordCountForNextMemCheck = min(max(minRowCountForBlockSizeCheck, recordCount / 2), maxRowCountForBlockSizeCheck); Review Comment: Yes,it's like the config 'parquet.page.size.row.check.min'、'parquet.page.size.row.check.max'. The rowgroup check algorithm is like the page check algorithm with config set ‘parquet.page.size.check.estimate’ true . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2242) record count for row group size check configurable
[ https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687805#comment-17687805 ] ASF GitHub Bot commented on PARQUET-2242: - xjlem commented on code in PR #1024: URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104147184 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java: ## @@ -142,6 +142,8 @@ public static enum JobSummaryLevel { public static final String MAX_PADDING_BYTES= "parquet.writer.max-padding"; public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min"; public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max"; + public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = "parquet.block.size.row.check.min"; Review Comment: [apache:parquet-1.10.x](https://github.com/apache/parquet-mr/tree/parquet-1.10.x) doesn't has README.md in the parquet-hadoop directory and now I add README.md in this branch > record count for row group size check configurable > --- > > Key: PARQUET-2242 > URL: https://issues.apache.org/jira/browse/PARQUET-2242 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: xjlem >Priority: Major > > org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached > {code:java} > private void checkBlockSizeReached() throws IOException { > if (recordCount >= recordCountForNextMemCheck) { // checking the memory > size is relatively expensive, so let's not do it for every record. > long memSize = columnStore.getBufferedSize(); > long recordSize = memSize / recordCount; > // flush the row group if it is within ~2 records of the limit > // it is much better to be slightly under size than to be over at all > if (memSize > (nextRowGroupSize - 2 * recordSize)) { > LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, > nextRowGroupSize, recordCount); > flushRowGroupToStore(); > initStore(); > recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, > recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK); > this.lastRowGroupEndPos = parquetFileWriter.getPos(); > } else { > recordCountForNextMemCheck = min( > max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + > (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway > recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look > more than max records ahead > ); > LOG.debug("Checked mem at {} will check again at: {}", recordCount, > recordCountForNextMemCheck); > } > } > } {code} > in this code,if the block size is small ,for example 8M,and the first 100 > lines record size is small and after 100 lines the record size is big,it > will cause big row group,in our real scene,it will more than 64M. So i think > the size for block check can configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] xjlem commented on a diff in pull request #1024: PARQUET-2242:record count for row group size check configurable
xjlem commented on code in PR #1024: URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104147184 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java: ## @@ -142,6 +142,8 @@ public static enum JobSummaryLevel { public static final String MAX_PADDING_BYTES= "parquet.writer.max-padding"; public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min"; public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max"; + public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = "parquet.block.size.row.check.min"; Review Comment: [apache:parquet-1.10.x](https://github.com/apache/parquet-mr/tree/parquet-1.10.x) doesn't has README.md in the parquet-hadoop directory and now I add README.md in this branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org