Rajmund Takacs created NIFI-12843:
-------------------------------------
Summary: If record count is set, ParquetRecordReader does not read
the whole file
Key: NIFI-12843
URL: https://issues.apache.org/jira/browse/NIFI-12843
Project: Apache NiFi
Issue Type: Bug
Components: Extensions
Affects Versions: 2.0.0-M2, 1.25.0
Reporter: Rajmund Takacs
Assignee: Rajmund Takacs
Earlier ParquetRecordReader ignored the record.count attribue of the incoming
FlowFile. With NIFI-12241 this had been changed, and now the reader reads only
the specified number of rows from the record set. But if the Parquet file is
not produced by a record writer, then this attribute is not set normally, and
in this case the record reader reads the whole file. However, processors
producing parquet file by processing record sets, might have this attribute
set, referring to the record set the parquet file is taken from, and not the
actual content. This leads to an incorrect behavior.
For example: ConsumeKafka produces a single record FlowFile, that is a parquet
file with 1000 rows, then record.count would be set to 1, instead of 1000,
because it refers to the Kafka record set. So ParquetRecordReader now reads
only the first record of the Parquet file.
The sole reason of changing the reader to take record.count into account is
that, CalculateParquetOffsets processors generate flow files with same content,
but different offset and count attributes, representing a slice of the
original, big input. And then the parquet reader acts as if the big flow file
was only a small one, containing that slice, which makes processing more
efficient. There is no need to support files having no offset, but having a
limit (count), so changing the reader to only take record.count into account,
if offset is present too, could to be a reasonable fix.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)