[
https://issues.apache.org/jira/browse/PARQUET-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760839#comment-17760839
]
ASF GitHub Bot commented on PARQUET-2343:
-----------------------------------------
ConeyLiu opened a new pull request, #1136:
URL: https://github.com/apache/parquet-mr/pull/1136
Currently, the ParquetRewiter creates the `ColumnReadStoreImpl crStore` and
reuses it for all the blocks rewriting. This should be incorrect and we should
create the `crStore` for each block that needs to be rewritten. Otherwise, we
will fail as the following:
```java
java.lang.NullPointerException
at
org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
at
org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
at
org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:735)
at
org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
at
org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:47)
at
org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
at
org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocksFromReader(ParquetRewriter.java:316)
at
org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:250)
```
### Jira
- [ ] My PR addresses the following [Parquet
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with
the [ASF 3rd Party License
Policy](https://www.apache.org/legal/resolved.html#category-x).
### Tests
- [ ] My PR adds the following unit tests __OR__ does not need testing for
this extremely good reason:
### Commits
- [ ] My commits all reference Jira issues in their subject lines. In
addition, my commits follow the guidelines from "[How to write a good git
commit message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
1. Subject is limited to 50 characters (not including Jira issue reference)
1. Subject does not end with a period
1. Subject uses the imperative mood ("add", not "adding")
1. Body wraps at 72 characters
1. Body explains "what" and "why", not "how"
### Documentation
- [ ] In case of new functionality, my PR adds documentation that describes
how to use it.
- All the public functions and the classes in the PR contain Javadoc that
explain what it does
> Fixes NPE when rewriting file with multiple rowgroups
> -----------------------------------------------------
>
> Key: PARQUET-2343
> URL: https://issues.apache.org/jira/browse/PARQUET-2343
> Project: Parquet
> Issue Type: Bug
> Reporter: Xianyang Liu
> Priority: Major
>
> Currently, the ParquetRewiter creates the `ColumnReadStoreImpl crStore` and
> reuses it for all the blocks rewriting. This should be incorrect and we
> should create the `crStore` for each block that needs to be rewritten.
> Otherwise, we will fail as the following:
> ```java
> java.lang.NullPointerException
> at
> org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
> at
> org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
> at
> org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:735)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:47)
> at
> org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
> at
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocksFromReader(ParquetRewriter.java:316)
> at
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:250)
> ```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)