Hi Gunnar, Thank you very much for help. I really appreciate it.
I believe you might be right. But still Flink has it’s own connector to Oracle which does not require Debezium. I’m not an expert and don’t have a wider view how most customer sites work. My intention was just to clearly shot what the current state of development is. I’m getting close to the point to make a decision if OpenLogReplicator should also have a commercial version with some enterprise features or not, or maybe reduce my time spent on the project and just see what happens. Regards, Adam > On 6 Jan 2023, at 13:18, Gunnar Morling > <gunnar.morl...@googlemail.com.INVALID> wrote: > > Hey Adam, all, > > Just came across this thread, still remembering the good conversations > we had around this while I was working on Debezium full-time :) > > Personally, I still believe the best way forward with this would be to > add support to the Debezium connector for Oracle so it can ingest > changes from a remote OpenLogReplicator instance via that server > you've built. That way, you don't need to deal with any Kafka > specifics, users would inherit the existing functionality for > backfilling, integration with Debezium Server (i.e. for non-Kafka > scenarios like Apache Pulsar, Kinesis, etc.), and Debezium engine > (which is what Flink CDC is based on). The Debezium connector for > Oracle is already built in a way that it supports multiple stream > ingestion adapters (currently LogMiner and XStream), so adding another > one for OLR would be rather simple. This approach would simplify > things from a Flink (CDC) perspective a lot. > > I've just pinged the folks over in the Debezium community on this, it > would be great to see progress in this matter. > > Best, > > --Gunnar > > > Am Do., 5. Jan. 2023 um 20:55 Uhr schrieb Adam Leszczyński > <aleszczyn...@bersler.com>: >> >> Thanks Leonard, Jark, >> >> I will just reply on the dev list for this topic as this is more related >> with development. Sorry, I have sent on 2 lists - I don’t want to add more >> chaos here. >> >> The answer to your question is not straight, so I will start from a broader >> picture. >> >> Maybe first I will describe some assumptions that I have chosen while >> designing OpenLogReplicator. The project is aimed to be minimalistic. It >> should only contain the code that is necessary to do parsing of Oracle redo >> logs. Nothing more, it should not be a fully functional replicator. So, the >> targets are limited to middleware (like Kafka, Flink, some MQ). The amount >> of dependencies is reduced to minimal. >> >> The second assumption is to make the project stateless wherever possible. >> The goal is to put on HA (Kubernetes) and store state in Redis (not yet >> implemented). But generally OpenLogReplicator should not handle the >> information (if possible) about the position of data confirmed by the >> receiver. This would allow the receiver to choose way of handling failures >> (data to be duplicated on restart, idempotent message). >> >> The third topic is initial data load. There is plenty of available software >> for that. There is absolutely no need to duplicate it in this project. No >> ETL, selects, etc. My goal is just to track changes. >> >> The fourth assumption is to write code in C++ so that the code is fast, and >> I have full control over memory. The code can fully reuse memory and work >> also with machines with little memory. This allows easy compilation on >> Linux, but maybe in the future also on Solaris, AIX, HP-UX, or even Windows >> (if there is demand for that). I think Java is good for some solutions but >> not for a binary parser which heavily works with memory and in most cases >> uses zero copy approach. >> >> Amount of data in the output is actually defined by source database (how >> much data is logged - full schema or just changed columns). I don’t care. >> The user defines that what is logged by db. If just primary key and changed >> columns - I can send just changed data. If someone wants full schema in >> every payload - this is fine too. If schema changes - no problem, I can >> provide just DDL commands and process further payloads with new schema. >> >> Format of data - this is actually defined by the receiver. My first choice >> was JSON. Next the Debezium guys asked me to support Protobuf. Ok, I have >> spend a lot of time and extended the architecture to actually make the code >> modular and allow to choose the format of the payload. The writer module can >> directly produce json or protobuf payload. Actually that can be extended to >> any other format if there is demand for that. Also the json format allows >> many options regarding format. I generally don’t test protobuf format code - >> I would treat that as a prototype because I know nobody who would like to >> use it. This code was planned for Debezium request but so far nobody cares. >> >> For integration with other systems, languages - this is an open case. >> Actually I am here agnostic. The data that is produced for output is stored >> in a buffer and can be sent to any target. This is done by the Writer module >> (you can look at the code) and there is a writer for Kafka, ZeroMQ and even >> plain network tcp/ip connection. I don’t understand the question regarding >> to adapt that better. If I have a specification I can extend. Say what you >> need. >> >> In such case when we have bidirectional connection not like with Kafka - the >> receiver can define starting position of data (scn) of the stream he/she >> wants to receive. >> >> You can look at the prototype code how this communication would look like: >> StreamClient.cpp - but please rather treat that as a working prototype. This >> is a client which just connects to OpenLogReplicator using network, and >> defines the starting scn and then just receives payload. >> >> In case when: >> - The connection is broken: the client would reconnect and tell the last >> confirmed data and just ask for the following transactions >> - If OpenLogReplicator crashes - after restart the client would tell the >> last confirmed data and ask for the following transactions >> - If the client crashes - the client would need to recover itself and ask >> for the transactions that are after the data that is confirmed by the client >> >> I assume that if the client confirms about some scn that is processed, >> OpenLogReplicator can remove that from cache and it is not possible that >> after reconnect the client would demand some data that it previously >> declared as confirmed. >> >> Well, >> This is what is currently done, some code was driven by request from the >> Debezium team towards future integration, like Support for Protobuf or put >> some data to the payload. But never used. >> We have opened a ticket in their Jira for integration: >> https://issues.redhat.com/projects/DBZ/issues/DBZ-2543?filter=allopenissues >> But there is no progress and no feedback if they want to make integration or >> not. I have made some effort to allow easier integration but I’m not going >> to write a Kafka connect code for OpenLogReplicator. I just don’t have >> resources for that. I think they focused on their own approach with >> LogMiner, waiting for OpenLogReplicator to become more mature before any >> integration would be done. If you want to depend Flink integration on the >> integration with Debezium. This may never happen. >> >> I was focused recently mostly on making the code stable and releasing >> version 1.0 and achieved that point. I am not aware of any problems with the >> code that is currently working. The code is aimed to be modular and allow >> easy integration, but as you mentioned there is no SDK. Actually this is the >> topic that I would like to talk about. Is there reason for some SDK? Would >> someone find it useful? Maybe just plain Kafka is enough. Maybe it would be >> best if someone took the code and rewrote to Java? But definitely not me - I >> would find that nonsense. Java code would suffer. >> >> What kind of interface would be best for Flink? >> OpenLogReplicator produces payload in protobuf or json. If you want to use >> for example xml it would be waste to write a converter, I would definitely >> prefer to add another writer module that would just produce xml instead. If >> you need certain format - this is no problem. >> >> But if you want to have full initial data load (snapshot) - this can’t be >> done because this project is not for that. You have your own good code. >> >> In practice I think there would be just a few projects which could be the >> receiver of data from OpenLogReplicator and there is no reason for writing a >> generic SDK for everybody. >> >> My goal was just to start a conversation - discuss if such integration >> really makes sense, or not. I really prefer simple architecture, as little >> conversions of data as necessary. Not that I would give some format but you >> would convert that anyway. This way replication from Oracle can be really >> fast. >> >> I’m just about beginning to write tutorials for OpenLogReplicator and the >> documentation is out of data. I have a regular daily job which I need to pay >> the rent, and a family, and work on this projects just in free time so the >> progress is slow. I don’t expect that to change in the future. But in spite >> of that, I know companies who already use the code In production and it >> works fast and stable. Client’s perspective is that it works 10 times faster >> than LogMiner - but this would be dependent on the actual case. You would >> need to make a benchmark and test yourself. >> >> >> Regards, >> Adam >> >> >> >> >>> On 5 Jan 2023, at 09:41, Leonard Xu <xbjt...@gmail.com> wrote: >>> >>> Hi, Adam & Márton >>> >>> Thanks for bringing the discussion here. >>> >>> The Flink CDC project provides the Oracle CDC Connector, which can be used >>> to capture historical and transaction log data from the Oracle database and >>> ingest it into Flink. In the latest version 2.3, the Oracle CDC Connector >>> already supports the parallel-incremental snapshot algorithm is supported, >>> which supports parallel reading for historical data and lock-free switching >>> from historical reading to transaction log reading. In the phase of >>> capturing transaction log data, the connector uses Debezium as the library, >>> which supports LogMiner and XStream API to capture change data. IIUC that >>> OpenLogReplicator can be used as a third way. >>> >>> For integrating OpenLogReplicator, there are several interesting points >>> that we can discuss further: >>> (1) All Flink CDC connectors do not rely on Kafka or other message queue >>> storage, and are directly calculated after data capture. I think the >>> network stream way of OpenLogReplicator needs to be adapted better. >>> (2) The Flink CDC project is mainly developed in Java as well as Flink. >>> Does OpenLogReplicator provide Java SDK for easy integration? >>> (3) If OpenLogReplicator have a plan to be integrated into the Debezium >>> project firstly, the Flink CDC project can easily integrate >>> OpenLogReplicator by bumping Debezium version. >>> >>> Best, >>> Leonard >> >>> On 5 Jan 2023, at 04:15, Jark Wu <imj...@gmail.com> wrote: >>> >>> Hi Adam, >>> >>> Thanks for sharing this interesting project. I think it definitely is >>> valuable for users for better speed. >>> >>> I am one of the maintainers of flink-cdc-connector project. The project >>> offers a “oracle-cdc” connector which uses Debezium (depends on LogMiner) >>> as the CDC library. From the perspective of “oracle-cdc” connector, I have >>> some questions about OpenLogRelicator: >>> >>> 1) Can OpenLogReplicator provide a Java SDK to allow Flink to communicate >>> with Oracle server directly without deploying any other service? >>> 2) How much overhead on Oracle compared to the LogMiner approach? >>> 3) Did you discuss this with the Debezium community? I think Debezium might >>> be interested in this project as well. >>> >>> >>> Best, >>> Jark >>> >>>> 2023年1月5日 07:32,Adam Leszczyński <aleszczyn...@bersler.com> 写道: >>>> >>>> H Márton, >>>> >>>> Thank you very much for your answer. >>>> >>>> The point with Kafka makes sense. It offers huge bag of potential >>>> connectors that could be used. >>>> But … not everybody wants or needs Kafka. This brings additional >>>> architectural >>>> complication and delays, which might not be acceptable by everybody. >>>> That’s why you do have your own connectors anyway. >>>> >>>> The Flink connector which reads from Oracle utilizes the LogMiner >>>> technology, which >>>> Is not acceptable for every user. It has big limitation regarding speed. >>>> You can overcome that only with a binary reader of the database redo log >>>> (like 10 times >>>> faster and delay even up to 50-100ms). >>>> >>>> The reason I am asking is not just to create some additional connector >>>> just for fun. >>>> My main concern is if there is actual demand from users for bigger speed of >>>> getting changes from the source database or having lower delay. >>>> You can find a lot of information in the net about differences between a >>>> log-based and >>>> one which is using logminer technology. >>>> >>>> I think, that would be enough for a start. Please tell me what you think >>>> about it. >>>> Would anyone consider using such connector? >>>> >>>> Regards, >>>> Adam Leszczyński >>>> >>>> >>>>> On 4 Jan 2023, at 12:07, Márton Balassi <balassi.mar...@gmail.com> wrote: >>>>> >>>>> (cc Leonard) >>>>> >>>>> Hi Adam, >>>>> >>>>> From an architectural perspective if you land the records to Kafka or >>>>> other >>>>> message broker Flink will be able to process them, at this point I do not >>>>> see much merit trying to circumvent this step. >>>>> There is a related project in the Flink space called CDC connectors [1], I >>>>> highly encourage you to check that out for context and ccd Leonard one of >>>>> its primary maintainers. >>>>> >>>>> [1] https://github.com/ververica/flink-cdc-connectors/ >>>>> >>>>> On Tue, Jan 3, 2023 at 8:40 PM Adam Leszczyński <aleszczyn...@bersler.com> >>>>> wrote: >>>>> >>>>>> Hi Flink Team, >>>>>> >>>>>> I’m the author of OpenLogReplictor - open source parser of Oracle redo >>>>>> logs which allows to send transactions >>>>>> to some message bus. Currently the sink that is implemented is just text >>>>>> file or Kafka topic. >>>>>> Also transactions can be sent using plain tcp connection or some message >>>>>> queue like ZeroMQ. >>>>>> Code is GPL and all versions from 11.2 are supported. No LogMiner needed. >>>>>> >>>>>> Transactions can be sent using json or protobuf format. Currently the >>>>>> code >>>>>> has reached GA and is actually used in production. >>>>>> >>>>>> The architecture is modular and allows very easily to add other sinks >>>>>> like >>>>>> for example Apache Flink. >>>>>> Actually I’m going towards approach that OpenLogReplicator could used >>>>>> Kubernetes and work in HA. >>>>>> >>>>>> Well… that is the general direction. Do you think there could some >>>>>> application of this soft with Apache Flink? >>>>>> For example very easily there could be some client which could connect to >>>>>> OpenLogReplicator using tcp connection >>>>>> and get transactions and just send them to Apache Flink. An example of >>>>>> such client is also present in GitHub repo. >>>>>> https://github.com/bersler/OpenLogReplicator >>>>>> >>>>>> Is there any rational for such integration? Or just a waste of time cause >>>>>> nobody would use it anyway? >>>>>> >>>>>> Kind regards, >>>>>> Adam Leszczyński >>>>>> >>>>>> >>>> >>> >>