[ 
https://issues.apache.org/jira/browse/NIFI-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794399#comment-17794399
 ] 

Lars Francke commented on NIFI-6515:
------------------------------------

Thanks [~bbende], my analysis was probably flawed.

What we see is the processor reading a file, it takes ~20-30min and then a 
single output FlowFile appears (which in our case was over 100GB in size), it 
might very well be that internally it reads a single thing at a time but as a 
user I first see output after a big delay.

I'd like to be able to process records as soon as the file has been opened. I 
understood this issue to be about that but - as you can see - my understanding 
of how the NiFi internals work is not the best (anymore) :)
I'm happy to withdraw my earlier commend and would then open a new issue when I 
have time to work on it.
I also had a brief conversation with [~joewitt] on Slack about this where he 
basically recommended finding splitpoints in a file before processing it:

??I do recommend finding out if it is possible to seek to a certain offset when 
pull Parquet data.  If you can do that then what could be done is a component 
between the List* processor and FetchParquet which splits the thing to List 
with offsets.  Then FetchParquet can understand the offset and grab just the 
portion it is assigned.  That would be safe and more scalable.  I just dont 
know if the concept of offsets is real here.  If it is - game on??

> FetchParquet max FlowFile size
> ------------------------------
>
>                 Key: NIFI-6515
>                 URL: https://issues.apache.org/jira/browse/NIFI-6515
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Matt Gilman
>            Priority: Major
>
> FetchParquet cannot transfer out multiple FlowFiles. We should introduce a 
> new property to set the size of the outgoing FlowFile and then transfer as 
> many FlowFiles as needed based on the fetched data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to