Re: [PR] NIFI-12241 Efficient Parquet Splitting [nifi]

via GitHub Thu, 19 Oct 2023 23:13:26 -0700


takraj commented on PR #7893:
URL: https://github.com/apache/nifi/pull/7893#issuecomment-1772142723


   @pvillard31
   
   * I guess creating the 50 clones is actually copying the input 50 times. You 
could also give it a try with 'Zero Content Output" setup.
   * To determine how many records are in the Parquet file,  
CalculateParquetOffsets needs to read some parts of the file, for which it uses 
the same library as ParquetReader does. I'm not sure how this library 
determines the record count, but maybe it scans the whole file and counts them. 
But, if your input FlowFile has a 'record.count' attribute, then this step is 
skipped. Give it a try, I'd expect the whole process faster.
   * ParquetReader uses RecordFilter to 'jump' to the wanted record. But as 
this is a general concept in the library we use, this means that the whole file 
is scanned through, and RecordFilter is evaluated to each of the records. 
Unfortunately, there is no concept like 'Jump to a record index'. Although I 
have an optimization idea of stop reading the records after we have read the 
desired number of records. RecordFilter then would be used only to find the 
first record to be read. I'll add a new commit soon with this improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] NIFI-12241 Efficient Parquet Splitting [nifi]

Reply via email to