Hello Nikolay, Thanks for the suggestion, it definitely may be a good feature, however, I do not see any significant value that it currently adds to the already existing WAL Iterator. I think the following issues should be addressed, otherwise, no regular user will be able to use the CDC reliably:
- The interface exposes WALRecord which is a private API - There is no way to start capturing changes from a certain point (a watermark for already processed data). Users can configure a large size for WAL archive to sustain long node downtime for historical rebalance. If a CDC agent is restarted, it will have to start from scratch. I see that it is present in the IEP as a design choice, but I think this is a major usability issue - If a CDC reader does not keep up with the WAL write rate (e.g. there is a short-term write burst and WAL archive is small), the Ignite node will delete WAL segments while the consumer is still reading it. Since the consumer is running out-of-process, we need to specify some sort of synchronization protocol between the node and the consumer - If Ignite node crashes, gets restarted and initiates full rebalance, the consumer will lose some updates - Usually, it makes sense for the CDC consumer to read updates only on primary nodes (otherwise, multiple agents will be doing duplicate work). In the current design, the consumer will not be able to differentiate primary/backup updates. Moreover, even if we wrote such flags to WAL, the consumer would need to process backup records anyway because it is unknown whether the primary consumer is alive. In other words, how would an end user organize the CDC failover minimizing the duplicate work? ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <[email protected]>: > Hello, Igniters. > > I want to start a discussion of the new feature [1] > > CDC - capture data change. The feature allows the consumer to receive > online notifications about data record changes. > > It can be used in the following scenarios: > * Export data into some warehouse, full-text search, or > distributed log system. > * Online statistics and analytics. > * Wait and respond to some specific events or data changes. > > Propose to implement new IgniteCDC application as follows: > * Run on the server node host. > * Watches for the appearance of the WAL archive segments. > * Iterates it using existing WALIterator and notifies consumer of > each record from the segment. > > IgniteCDC features: > * Independence from the server node process (JVM) - issues and > failures of the consumer will not lead to server node instability. > * Notification guarantees and failover - i.e. CDC track and save > the pointer to the last consumed record. Continue notification from this > pointer in case of restart. > * Resilience for the consumer - it's not an issue when a consumer > temporarily consumes slower than data appear. > > WDYT? > > [1] > https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
