[
https://issues.apache.org/jira/browse/NIFI-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351214#comment-15351214
]
Matt Burgess commented on NIFI-2126:
------------------------------------
A proposed solution is to add two processors:
1) ListDatabaseTables: This processor would use a DatabaseConnectionPool
controller service, call getTables(), and if the (optional,
defaulting-to-false) property "Include Row Count" is set, then a "SELECT
COUNT(1) from table" would be issued to the database. The table name (and its
count if specified) would be included as attributes in a zero-content flow file.
2) GenerateTableFetch: This processor would presumably have a
ListDatabaseTables in front of it and will use the same DatabaseConnectionPool
service. It will read the aforementioned attributes along with an optional
"Partition Size" property (which accepts Expression Language). The information
is used to generate flow files containing SQL statements that will select rows
from a table. If the partition size is indicated, then the SELECT statements
will refer to a range of rows, such that each statement will grab only a
portion of the table. These flow files (due to NIFI-1973) can be passed to
ExecuteSQL processors for the actual fetching of rows.
This offers a generally useful processor for database table metadata
(ListDatabaseTables), as well as a distributable solution for fetching sections
of database tables, to be used for massive data migration, etc.
> Add processors to enable distributed fetching of database tables
> ----------------------------------------------------------------
>
> Key: NIFI-2126
> URL: https://issues.apache.org/jira/browse/NIFI-2126
> Project: Apache NiFi
> Issue Type: New Feature
> Reporter: Matt Burgess
> Assignee: Matt Burgess
> Fix For: 1.0.0
>
>
> To enable NiFi to migrate/move data from RDBMS source tables to other target
> systems (other RDMBS, HDFS, etc.), one approach is to be able to distribute
> the fetching of large tables across various tasks/nodes, rather than a single
> ExecuteSQL processor (which for large tables can run out of memory and get
> slow).
> The idea would be to generate flow files containing SQL statements that would
> fetch a portion (or "page") of a table. These flow files can be distributed
> in NiFi to many ExecuteSQL processors, each of which would grab a page and
> emit the results. The flow(s) can then continue in parallel/distributed
> fashion until the data is in the target location(s).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)