[ 
https://issues.apache.org/jira/browse/PARQUET-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779550#comment-17779550
 ] 

ASF GitHub Bot commented on PARQUET-2365:
-----------------------------------------

wgtmac commented on code in PR #1173:
URL: https://github.com/apache/parquet-mr/pull/1173#discussion_r1371930026


##########
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##########
@@ -543,6 +546,11 @@ public static ColumnIndex build(
    *          the statistics to be added
    */
   public void add(Statistics<?> stats) {
+    if (stats.isEmpty()) {

Review Comment:
   This is used by the rewriter to rebuild the ColumnIndex. The change enables 
it to invalidate the ColumnIndex if any statistics is invalid, right? This is 
also used by the general file writer. Does it affect the behavior? If yes, 
probably we need add a test case.



##########
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##########
@@ -543,6 +546,11 @@ public static ColumnIndex build(
    *          the statistics to be added
    */
   public void add(Statistics<?> stats) {
+    if (stats.isEmpty()) {

Review Comment:
   From what you have said, it seems that the problem comes from the input file 
which has valid aggregate statistics for the column chunk but does not write 
page statistics in the page header. Should we just fix the NPE in the page 
header and leave other parts as is?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -988,6 +988,21 @@ void writeColumnChunk(ColumnDescriptor descriptor,
     endColumn();
   }
 
+  /**
+   * Overwrite the column total statistics. This special used when the column 
total statistics
+   * is known while all the page statistics are invalid, for example when 
rewriting the column.
+   *
+   * @param totalStatistics the column total statistics
+   * @throws IOException if there is an error while writing
+   */
+  public void endColumn(Statistics<?> totalStatistics) throws IOException {
+    Preconditions.checkArgument(totalStatistics != null, "Column total 
statistics can not be null");
+    currentStatistics = totalStatistics;
+    // Invalid the ColumnIndex
+    columnIndexBuilder = ColumnIndexBuilder.getNoOpBuilder();

Review Comment:
   This looks hacky, TBH.





> Fixes NPE when rewriting column without column index
> ----------------------------------------------------
>
>                 Key: PARQUET-2365
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2365
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Xianyang Liu
>            Priority: Major
>
> The ColumnIndex could be null in some scenes, for example, the float/double 
> column contains NaN or the size has exceeded the expected value. And the page 
> header statistics are not written anymore after we supported ColumnIndex. So 
> we will get NPE when rewriting the column without ColumnIndex due to we need 
> to get NULL page statistics when converted from the ColumnIndex(NULL) or page 
> header statistics(NULL). Such as the following:
> ```java
> java.lang.NullPointerException
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:727)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.innerWriteDataPage(ParquetFileWriter.java:663)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:650)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processChunk(ParquetRewriter.java:453)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocksFromReader(ParquetRewriter.java:317)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:250)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to