Re: [PR] NIFI-12241 Efficient Parquet Splitting [nifi]

via GitHub Mon, 06 Nov 2023 01:10:01 -0800


takraj commented on PR #7893:
URL: https://github.com/apache/nifi/pull/7893#issuecomment-1794369725


   @pvillard31 Did you try the new `CalculateParquetRowGroupOffsets` processor 
too? In this one, you cannot configre the number of records in splits, because 
it splits by row group boundaries. But since it calculates where the 
`ParquetReader` should seek in the file, the data is distributed and processed 
much more efficiently. My measurements showed major improvement in performance.
   
   You can also combine the two as `CalculateParquetRowGroupOffsets -> 
CalculateParquetOffsets`, so you would still have the control of how many 
records fall into a single FlowFile too.
   
   I can't really do any further improvement to this, due to API limitations of 
the underlying library.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] NIFI-12241 Efficient Parquet Splitting [nifi]

Reply via email to