Re: Architecting the MS SQL CDC Processor

Matt Burgess Mon, 16 Oct 2017 19:59:59 -0700

Peter,

This is great to hear, I'm sure the community is looking forward to
such a solution!  I worked on the first offering of the
CaptureChangeMySQL processor, so here are some notes, comments, and
(hopefully!) answers to your questions:

* If you support a RecordSetWriter controller service as your output,
then you won't need JdbcCommon per se; instead you would create
Records and pass those to the user-selected RecordSetWriter. In that
sense you can support Avro, CSV, JSON, or anything else for which
there is a RecordSetWriter implementation.

* Depending on how often you'll be updating state, you may want to
implement something similar to the State Update Interval property in
CaptureChangeMySQL, which came about due to similar concerns about the
overhead of state updates vs the amount of processing beforehand.
This allows to user to tune the tradeoff, based on their own
requirements and performance and such.

* I have no concerns with having a different output format from
CaptureChangeMySQL; in fact the only reason it doesn't have a
RecordSetWriter output interface is that those capabilities were being
developed in parallel, so rather than have to wait for the
record-aware API stuff, I chose to output JSON. I have written
NIFI-4491 to improve/augment CDC processor(s) with RecordSetWriter
support. This would be very helpful by supporting various output
formats as well as generating the accompanying schema. If your
processor were the first to support this, it could be the exemplar for
past and future CDC processors :)

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-4491

On Mon, Oct 16, 2017 at 10:36 PM, Peter Wicks (pwicks)
<[email protected]> wrote:
> I've been working on a new processor that does Change Data Capture with 
> Microsoft SQL Server. I followed Microsoft's documentation on how CDC works, 
> and I've got some code that gets me the changes and is testing well. Right 
> now, I don't actually have a processor, but a number of scripts that generate 
> SQL and I put it into ExecuteSQL and QueryDatabaseTable processors; with QDB 
> using my as-yet incomplete 
> NIFI-1706<https://github.com/apache/nifi/pull/2162>.
>
> One of the reasons I don't have a processor yet is because I don't want to 
> use the same output format as the MySQL CDC Processor, but I didn't want to 
> put in the time if it was not going to get merged. The MySQL CDC processor 
> uses JSON messages as the output format, but in MS SQL the CDC messages are 
> rows in a table; and it's much more convenient to output them as records. 
> Currently, I'm using Avro.
>
> Questions:
>
>   *   My output format doesn't have to be Avro, but given the source is rows 
> in a table being returned by a ResultSet, using the JdbcCommon class makes a 
> lot of sense to me. Can I move JdbcCommon to somewhere useful like 
> nifi-avro-record-utils?
>   *   I'll be looping through a list of tables and plan on committing the 
> files immediately to the success relationship as that table's CDC records are 
> pulled. I want to make sure that the max value tracking gets updated 
> immediately too. Does calling setState on the State Manager cause an 
> immediate state save? Is this safe to call repeatedly, assuming single 
> threaded, during the execution of the processor?
>   *   Concerns with using a different output format than the MySQL CDC 
> Processor?
>
> Thanks,
>   Peter

Re: Architecting the MS SQL CDC Processor

Reply via email to