Repository: apex-malhar
Updated Branches:
refs/heads/master baff632ae -> 12d6183cf
APEXMALHAR-2179: Add documentation for JDBC Poll Input Operator
Project: http://git-wip-us.apache.org/repos/asf/apex-malhar/repo
Commit: http://git-wip-us.apache.org/repos/asf/apex-malhar/commit/87a72434
Tree: http://git-wip-us.apache.org/repos/asf/apex-malhar/tree/87a72434
Diff: http://git-wip-us.apache.org/repos/asf/apex-malhar/diff/87a72434
Branch: refs/heads/master
Commit: 87a72434274c27532c8f38a71dfe8e51e85cc8db
Parents: 0a924ad
Author: Priyanka Gugale
Authored: Tue Aug 9 15:45:57 2016 +0530
Committer: Priyanka Gugale
Committed: Wed Sep 21 23:51:58 2016 +0530
--
.../images/jdbcinput/operatorsClassDiagram.png | Bin 0 -> 49841 bytes
docs/operators/jdbcPollInputOperator.md | 175 +++
2 files changed, 175 insertions(+)
--
http://git-wip-us.apache.org/repos/asf/apex-malhar/blob/87a72434/docs/operators/images/jdbcinput/operatorsClassDiagram.png
--
diff --git a/docs/operators/images/jdbcinput/operatorsClassDiagram.png
b/docs/operators/images/jdbcinput/operatorsClassDiagram.png
new file mode 100644
index 000..4b0432d
Binary files /dev/null and
b/docs/operators/images/jdbcinput/operatorsClassDiagram.png differ
http://git-wip-us.apache.org/repos/asf/apex-malhar/blob/87a72434/docs/operators/jdbcPollInputOperator.md
--
diff --git a/docs/operators/jdbcPollInputOperator.md
b/docs/operators/jdbcPollInputOperator.md
new file mode 100644
index 000..aa1d107
--- /dev/null
+++ b/docs/operators/jdbcPollInputOperator.md
@@ -0,0 +1,175 @@
+JDBC Poller Input Operator
+=
+
+## Operator Objective
+This operator scans JDBC database table in parallel fashion. This operator is
added to address common input operator problems like,
+
+1. As discussed in [Development Best
Practices](https://github.com/apache/apex-core/blob/master/docs/development_best_practices.md),
+the operator callbacks such as `beginWindow()`, `endWindow()`,
`emitTuples()`, etc.
+(which are invoked by the main operator thread)
+are required to return quickly, well within the default streaming window
duration of
+500ms. This requirement can be an issue when retrieving data from slow
external systems
+such as databases or object stores: if the call takes too long, the
platform will deem
+the operator blocked and restart it. Restarting will often run into the
same issue
+causing an unbroken sequence of restarts.
+
+2. When a large volume of data is available from a single store that allows
reading from
+ arbitrary locations (such as a file or a database table), reading the data
sequentially
+ can be throughput limiting: Having multiple readers read from
non-overlapping sections
+ of the store allows any downstream parallelism in the DAG to be exploited
better to
+ enhance throughput. For files, this approach is used by the file splitter
and block
+ reader operators in the Malhar library.
+
+JDBC Poller Input operator addresses the first issue with an asynchronous
worker thread which retrieves the data and adds it to an in-memory queue; the
main operator thread dequeue tuples very quickly if data is available or simply
returns if not. The second is addressed in a way that parallels the approach to
files by having multiple partitions read records from non-overlapping areas of
the table. Additional details of how this is done are described below.
+
+ Assumption
+Assumption is that there is an ordered column using which range queries can be
formed. That means database has a column or combination of columns which has
unique constraint as well as every newly inserted record should have column
value more than max value in that column, as we poll only appended records.
+
+## Use cases
+1. Scan huge database tables to either copy to other database or process it
using **Apache Apex**. An example application using this operator to copy
database contents to HDFS is available in the [examples
repository](https://github.com/DataTorrent/examples/tree/master/tutorials/jdbcIngest).
Look for "PollJdbcToHDFSApp" for example of this particular operator.
+
+## How to Use?
+The tuple type in the abstract class is a generic parameter. Concrete
subclasses need to choose an appropriate class (such as String or an
appropriate concrete java class, having no-argument constructor so that it can
be serialized using kyro). Also implement a couple of abstract methods:
`getTuple(ResultSet)` to convert database rows to objects of concrete class and
`emitTuple(T)` to emit the tuple.
+
+In principle, no ports need be defined in the rare case that the operator
simply