[ https://issues.apache.org/jira/browse/PARQUET-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782491#comment-17782491 ]
ASF GitHub Bot commented on PARQUET-2365: ----------------------------------------- wgtmac commented on code in PR #1173: URL: https://github.com/apache/parquet-mr/pull/1173#discussion_r1381380912 ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java: ########## @@ -722,14 +722,7 @@ public void writeDataPage( LOG.debug("{}: write data page content {}", out.getPos(), compressedPageSize); bytes.writeAllTo(out); - // Copying the statistics if it is not initialized yet so we have the correct typed one - if (currentStatistics == null) { - currentStatistics = statistics.copy(); - } else { - currentStatistics.mergeStatistics(statistics); - } - - columnIndexBuilder.add(statistics); + mergeColumnStatistics(statistics); Review Comment: Do we need to fix here and below: https://github.com/apache/parquet-mr/blob/514cc6c257fe8e618b100a19d86d304f6442cb94/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L215 ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ########## @@ -392,6 +402,7 @@ private void processChunk(ColumnChunkMetaData chunk, DictionaryPage dictionaryPage = null; long readValues = 0; Statistics<?> statistics = null; + boolean needOverwriteColumnStatistics = false; Review Comment: ```suggestion boolean isColumnStatisticsMalformed = false; ``` ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ########## @@ -324,15 +324,20 @@ private void processBlocksFromReader(IndexCache indexCache) throws IOException { // Translate compression and/or encryption writer.startColumn(descriptor, crStore.getColumnReader(descriptor).getTotalValueCount(), newCodecName); - processChunk( + boolean needOverwriteStatistics = processChunk( chunk, newCodecName, columnChunkEncryptorRunTime, encryptColumn, indexCache.getBloomFilter(chunk), indexCache.getColumnIndex(chunk), indexCache.getOffsetIndex(chunk)); - writer.endColumn(); + if (needOverwriteStatistics) { + // All the column statistics are invalid, so we need to overwrite the column statistics + writer.endColumn(chunk.getStatistics()); Review Comment: Is it cleaner to handle this inside processChunk()? Then we don't have to bother with endColumn() ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java: ########## @@ -988,6 +988,21 @@ void writeColumnChunk(ColumnDescriptor descriptor, endColumn(); } + /** + * Overwrite the column total statistics. This special used when the column total statistics + * is known while all the page statistics are invalid, for example when rewriting the column. + * + * @param totalStatistics the column total statistics + * @throws IOException if there is an error while writing + */ + public void endColumn(Statistics<?> totalStatistics) throws IOException { Review Comment: Since this is used only for invalid stats, is it better to add a `void invalidateStatistics()` which simply include line 988 and 990? Then you just need to call invalidateStatistics() and endColumn() in this case. > Fixes NPE when rewriting column without column index > ---------------------------------------------------- > > Key: PARQUET-2365 > URL: https://issues.apache.org/jira/browse/PARQUET-2365 > Project: Parquet > Issue Type: Bug > Reporter: Xianyang Liu > Priority: Major > > The ColumnIndex could be null in some scenes, for example, the float/double > column contains NaN or the size has exceeded the expected value. And the page > header statistics are not written anymore after we supported ColumnIndex. So > we will get NPE when rewriting the column without ColumnIndex due to we need > to get NULL page statistics when converted from the ColumnIndex(NULL) or page > header statistics(NULL). Such as the following: > ```java > java.lang.NullPointerException > at > org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:727) > at > org.apache.parquet.hadoop.ParquetFileWriter.innerWriteDataPage(ParquetFileWriter.java:663) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:650) > at > org.apache.parquet.hadoop.rewrite.ParquetRewriter.processChunk(ParquetRewriter.java:453) > at > org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocksFromReader(ParquetRewriter.java:317) > at > org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:250) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)