[jira] [Updated] (NIFI-12843) If record count is set, ParquetRecordReader does not read the whole file

Rajmund Takacs (Jira) Tue, 27 Feb 2024 02:40:06 -0800


     [ 
https://issues.apache.org/jira/browse/NIFI-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rajmund Takacs updated NIFI-12843:
----------------------------------
    Attachment: parquet_reader_usecases.json

> If record count is set, ParquetRecordReader does not read the whole file
> ------------------------------------------------------------------------
>
>                 Key: NIFI-12843
>                 URL: https://issues.apache.org/jira/browse/NIFI-12843
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.25.0, 2.0.0-M2
>            Reporter: Rajmund Takacs
>            Assignee: Rajmund Takacs
>            Priority: Major
>         Attachments: parquet_reader_usecases.json
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Earlier ParquetRecordReader ignored the record.count attribue of the incoming 
> FlowFile. With NIFI-12241 this had been changed, and now the reader reads 
> only the specified number of rows from the record set. But if the Parquet 
> file is not produced by a record writer, then this attribute is not set 
> normally, and in this case the record reader reads the whole file. However, 
> processors producing parquet file by processing record sets, might have this 
> attribute set, referring to the record set the parquet file is taken from, 
> and not the actual content. This leads to an incorrect behavior.
> For example: ConsumeKafka produces a single record FlowFile, that is a 
> parquet file with 1000 rows, then record.count would be set to 1, instead of 
> 1000, because it refers to the Kafka record set. So ParquetRecordReader now 
> reads only the first record of the Parquet file.
> The sole reason of changing the reader to take record.count into account is 
> that, CalculateParquetOffsets processors generate flow files with same 
> content, but different offset and count attributes, representing a slice of 
> the original, big input. And then the parquet reader acts as if the big flow 
> file was only a small one, containing that slice, which makes processing more 
> efficient. There is no need to support files having no offset, but having a 
> limit (count), so changing the reader to only take record.count into account, 
> if offset is present too, could to be a reasonable fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NIFI-12843) If record count is set, ParquetRecordReader does not read the whole file

Reply via email to