[
https://issues.apache.org/jira/browse/NIFI-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajmund Takacs updated NIFI-12843:
----------------------------------
Attachment: parquet_reader_usecases.json
> If record count is set, ParquetRecordReader does not read the whole file
> ------------------------------------------------------------------------
>
> Key: NIFI-12843
> URL: https://issues.apache.org/jira/browse/NIFI-12843
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Affects Versions: 1.25.0, 2.0.0-M2
> Reporter: Rajmund Takacs
> Assignee: Rajmund Takacs
> Priority: Major
> Attachments: parquet_reader_usecases.json
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Earlier ParquetRecordReader ignored the record.count attribue of the incoming
> FlowFile. With NIFI-12241 this had been changed, and now the reader reads
> only the specified number of rows from the record set. But if the Parquet
> file is not produced by a record writer, then this attribute is not set
> normally, and in this case the record reader reads the whole file. However,
> processors producing parquet file by processing record sets, might have this
> attribute set, referring to the record set the parquet file is taken from,
> and not the actual content. This leads to an incorrect behavior.
> For example: ConsumeKafka produces a single record FlowFile, that is a
> parquet file with 1000 rows, then record.count would be set to 1, instead of
> 1000, because it refers to the Kafka record set. So ParquetRecordReader now
> reads only the first record of the Parquet file.
> The sole reason of changing the reader to take record.count into account is
> that, CalculateParquetOffsets processors generate flow files with same
> content, but different offset and count attributes, representing a slice of
> the original, big input. And then the parquet reader acts as if the big flow
> file was only a small one, containing that slice, which makes processing more
> efficient. There is no need to support files having no offset, but having a
> limit (count), so changing the reader to only take record.count into account,
> if offset is present too, could to be a reasonable fix.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)