takraj commented on PR #7893: URL: https://github.com/apache/nifi/pull/7893#issuecomment-1772142723
@pvillard31 * I guess creating the 50 clones is actually copying the input 50 times. You could also give it a try with 'Zero Content Output" setup. * To determine how many records are in the Parquet file, CalculateParquetOffsets needs to read some parts of the file, for which it uses the same library as ParquetReader does. I'm not sure how this library determines the record count, but maybe it scans the whole file and counts them. But, if your input FlowFile has a 'record.count' attribute, then this step is skipped. Give it a try, I'd expect the whole process faster. * ParquetReader uses RecordFilter to 'jump' to the wanted record. But as this is a general concept in the library we use, this means that the whole file is scanned through, and RecordFilter is evaluated to each of the records. Unfortunately, there is no concept like 'Jump to a record index'. Although I have an optimization idea of stop reading the records after we have read the desired number of records. RecordFilter then would be used only to find the first record to be read. I'll add a new commit soon with this improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
