[
https://issues.apache.org/jira/browse/NIFI-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816118#comment-17816118
]
Lucas Read commented on NIFI-4385:
----------------------------------
Adding some thoughts here based on some issues we have experienced with some
sql based processors over the past 2 weeks.
The default configuration for SQL-based processors (ExecuteSQL,
QueryDatabaseRecord, PutDatabaseRecord) in Apache NiFi currently sets the
fetch/batch size to 0. This default configuration can lead to significant heap
impacts with Java when querying or inserting large amounts of records. To
mitigate these impacts and improve performance, it's recommended to adjust the
default configuration to a more appropriate value, such as 100 or 1000.
*Suggested Changes:*
* Update the default configuration for the following SQL-based processors (Im
sure I am missing some) :
** *ExecuteSQL:* Set the default fetch/batch size to 100 or 1000.
** *QueryDatabaseRecord:* Set the default fetch/batch size to 100 or 1000.
** *PutDatabaseRecord:* Set the default fetch/batch size to 100 or 1000.
*Rationale:* The current default configuration of 0 for fetch/batch size can
lead to inefficient memory usage and potential heap issues, especially when
dealing with large datasets. By adjusting the default configuration to a more
suitable value, such as 100 or 1000, users can experience improved performance
and reduced risk of memory-related errors.
*Impact Analysis:*
* *Performance:* Updating the default configuration is expected to improve
performance for users dealing with large datasets, as it will optimize memory
usage and reduce the likelihood of heap-related issues.
* *Compatibility:* These changes should not impact existing workflows or
configurations adversely, as they are focused on improving default settings
rather than altering behavior.
> Adjust the QueryDatabaseTable processor for handling big tables.
> ----------------------------------------------------------------
>
> Key: NIFI-4385
> URL: https://issues.apache.org/jira/browse/NIFI-4385
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Affects Versions: 1.3.0
> Reporter: Tim Späth
> Priority: Major
>
> When querying large database tables, the *QueryDatabaseTable* processor does
> not perform very well.
> The processor will always perform the full query and then transfer all
> flowfiles as a list instead of
> transferring them particularly after the *ResultSet* is fetching the next
> rows(If a fetch size is given).
> If you want to query a billion rows from a table,
> the processor will add all flowfiles in an ArrayList<FlowFile> in memory
> before transferring the whole list after the last row is fetched by the
> ResultSet.
> I've checked the code in
> *org.apache.nifi.processors.standard.QueryDatabaseTable.java*
> and in my opinion, it would be no big deal to move the session.transfer to a
> proper position in the code (into the while loop where the flowfile is added
> to the list) to
> achieve a real _stream support_. There was also a bug report for this problem
> which resulted in adding the new property *Maximum Number of Fragments*,
> but this property will just limit the results.
> Now you have to multiply *Max Rows Per Flow File* with *Maximum Number of
> Fragments* to get your limit,
> which is not really a solution for the original problem imho.
> Also the workaround with GenerateTableFetch and/or ExecuteSQL processors is
> much slower than using a database cursor or a ResultSet
> and stream the rows in flowfiles directly in the queue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)