[ 
https://issues.apache.org/jira/browse/NIFI-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819430#comment-17819430
 ] 

Matt Burgess commented on NIFI-4385:
------------------------------------

I'm ok with changing the Fetch Size defaults but 0 leaves it up to the driver, 
and often their defaults are WAY too low. Would be nice to add more 
documentation around the choice of default as well as what it means to the 
user, as sometimes I've seen assumptions that that's how many rows will be in a 
FlowFile when there's a separate property for that.

Another thing I'm looking into is to see if I can speed everything up by doing 
the fetch in parallel (on multiple cores), fetching the specified number of 
rows and having another thread writing them to the FlowFile. It always depends 
on the use case but sometimes the Avro conversion of the ResultSet is "the long 
pole in the tent", so rather than that logic waiting on a fetch it can be 
working constantly while at least one separate thread ensures there's always 
data ready to be converted and written out. I'll write up a separate Jira for 
that once I fully characterize the issue and proposed solution. In the meantime 
I welcome all thoughts, comments, questions, and concerns right here :)

> Adjust the QueryDatabaseTable processor for handling big tables.
> ----------------------------------------------------------------
>
>                 Key: NIFI-4385
>                 URL: https://issues.apache.org/jira/browse/NIFI-4385
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>    Affects Versions: 1.3.0
>            Reporter: Tim Späth
>            Priority: Major
>
> When querying large database tables, the *QueryDatabaseTable* processor does 
> not perform very well.
> The processor will always perform the full query and then transfer all 
> flowfiles as a list instead of 
> transferring them particularly after the *ResultSet* is fetching the next 
> rows(If a fetch size is given). 
> If you want to query a billion rows from a table, 
> the processor will add all flowfiles in an ArrayList<FlowFile> in memory 
> before transferring the whole list after the last row is fetched by the 
> ResultSet. 
> I've checked the code in 
> *org.apache.nifi.processors.standard.QueryDatabaseTable.java* 
> and in my opinion, it would be no big deal to move the session.transfer to a 
> proper position in the code (into the while loop where the flowfile is added 
> to the list) to 
> achieve a real _stream support_. There was also a bug report for this problem 
> which resulted in adding the new property *Maximum Number of Fragments*, 
> but this property will just limit the results. 
> Now you have to multiply *Max Rows Per Flow File* with *Maximum Number of 
> Fragments* to get your limit, 
> which is not really a solution for the original problem imho. 
> Also the workaround with GenerateTableFetch and/or ExecuteSQL processors is 
> much slower than using a database cursor or a ResultSet
> and stream the rows in flowfiles directly in the queue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to