Ryan Blue created PARQUET-9:
-------------------------------
Summary: InternalParquetRecordReader will not read multiple blocks
when filtering
Key: PARQUET-9
URL: https://issues.apache.org/jira/browse/PARQUET-9
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Reporter: Ryan Blue
The InternalParquetRecordReader keeps track of the count of records it has
processed and uses that count to know when it is finished and when to load a
new row group of data. But when it is wrapping a FilteredRecordReader, this
count is not updated for records that are filtered, so when the reader exhausts
the block it is reading, it will continue calling read() on the filtered reader
and will pass null values to the caller.
The quick fix is to detect null values returned by the record reader and update
the count to read the next row group. But the longer-term solution is to
correctly account for the filtered records.
The pull request for the quick fix is
[#9|https://github.com/apache/incubator-parquet-mr/pull/9].
--
This message was sent by Atlassian JIRA
(v6.2#6252)