[ 
https://issues.apache.org/jira/browse/PARQUET-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782433#comment-17782433
 ] 

ASF GitHub Bot commented on PARQUET-2365:
-----------------------------------------

ConeyLiu commented on code in PR #1173:
URL: https://github.com/apache/parquet-mr/pull/1173#discussion_r1381213792


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -1317,6 +1324,27 @@ private int toIntWithCheck(long size) {
     return (int)size;
   }
 
+  private void mergeColumnStatistics(Statistics<?> statistics) {
+    if (!currentStatisticsAreValid) {

Review Comment:
   Do we support the case where the input statistics are empty? In that case, 
the `currentStatistics` will be empty after merging the first statistics. Then 
we have no idea whether the statistics are valid or not.





> Fixes NPE when rewriting column without column index
> ----------------------------------------------------
>
>                 Key: PARQUET-2365
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2365
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Xianyang Liu
>            Priority: Major
>
> The ColumnIndex could be null in some scenes, for example, the float/double 
> column contains NaN or the size has exceeded the expected value. And the page 
> header statistics are not written anymore after we supported ColumnIndex. So 
> we will get NPE when rewriting the column without ColumnIndex due to we need 
> to get NULL page statistics when converted from the ColumnIndex(NULL) or page 
> header statistics(NULL). Such as the following:
> ```java
> java.lang.NullPointerException
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:727)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.innerWriteDataPage(ParquetFileWriter.java:663)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:650)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processChunk(ParquetRewriter.java:453)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocksFromReader(ParquetRewriter.java:317)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:250)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to