[jira] [Commented] (PARQUET-2365) Fixes NPE when rewriting column without column index

ASF GitHub Bot (Jira) Thu, 19 Oct 2023 02:48:31 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777160#comment-17777160
 ]


ASF GitHub Bot commented on PARQUET-2365:
-----------------------------------------

ConeyLiu commented on code in PR #1173:
URL: https://github.com/apache/parquet-mr/pull/1173#discussion_r1365223686


##########
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##########
@@ -543,6 +546,11 @@ public static ColumnIndex build(
    *          the statistics to be added
    */
   public void add(Statistics<?> stats) {
+    if (stats.isEmpty()) {

Review Comment:
   The problem happens when both the `ColumnIndex` and the page header 
`Statistics` are null. Because we get `null` returned from the 
`convertStatistics`. However, the `ParquetFileWriter.writeDataPage` needs the 
page statistics. So here we pass invalid page statistics to avoid the NPE and 
overwrite the column statistics in the end. Otherwise, we need to add some 
methods that don't need page statistics.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -612,13 +612,13 @@ public void writeDataPage(
    * @throws IOException if any I/O error occurs during writing the file
    */
   public void writeDataPage(
-    int valueCount, int uncompressedPageSize,
-    BytesInput bytes,
-    Statistics<?> statistics,
-    long rowCount,
-    Encoding rlEncoding,
-    Encoding dlEncoding,
-    Encoding valuesEncoding) throws IOException {
+      int valueCount, int uncompressedPageSize,
+      BytesInput bytes,
+      Statistics<?> statistics,
+      long rowCount,
+      Encoding rlEncoding,
+      Encoding dlEncoding,
+      Encoding valuesEncoding) throws IOException {

Review Comment:
   Sure, will revert it.





> Fixes NPE when rewriting column without column index
> ----------------------------------------------------
>
>                 Key: PARQUET-2365
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2365
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Xianyang Liu
>            Priority: Major
>
> The ColumnIndex could be null in some scenes, for example, the float/double 
> column contains NaN or the size has exceeded the expected value. And the page 
> header statistics are not written anymore after we supported ColumnIndex. So 
> we will get NPE when rewriting the column without ColumnIndex due to we need 
> to get NULL page statistics when converted from the ColumnIndex(NULL) or page 
> header statistics(NULL). Such as the following:
> ```java
> java.lang.NullPointerException
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:727)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.innerWriteDataPage(ParquetFileWriter.java:663)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:650)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processChunk(ParquetRewriter.java:453)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocksFromReader(ParquetRewriter.java:317)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:250)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2365) Fixes NPE when rewriting column without column index

Reply via email to