Re: CDC from Oracle database reading directly logs - integration with OpenLogReplicator

Adam Leszczyński Sun, 08 Jan 2023 15:47:18 -0800

Hi Gunnar,

Thank you very much for help. I really appreciate it.


I believe you might be right.
But still Flink has it’s own connector to Oracle which does not require 
Debezium.
I’m not an expert and don’t have a wider view how most customer sites work.

My intention was just to clearly shot what the current state of development is.

I’m getting close to the point to make a decision if OpenLogReplicator should 
also have 
a commercial version with some enterprise features or not, or maybe reduce my 
time 
spent on the project and just see what happens.

Regards,
Adam



> On 6 Jan 2023, at 13:18, Gunnar Morling 
> <gunnar.morl...@googlemail.com.INVALID> wrote:
> 
> Hey Adam, all,
> 
> Just came across this thread, still remembering the good conversations
> we had around this while I was working on Debezium full-time :)
> 
> Personally, I still believe the best way forward with this would be to
> add support to the Debezium connector for Oracle so it can ingest
> changes from a remote OpenLogReplicator instance via that server
> you've built. That way, you don't need to deal with any Kafka
> specifics, users would inherit the existing functionality for
> backfilling, integration with Debezium Server (i.e. for non-Kafka
> scenarios like Apache Pulsar, Kinesis, etc.), and Debezium engine
> (which is what Flink CDC is based on). The Debezium connector for
> Oracle is already built in a way that it supports multiple stream
> ingestion adapters (currently LogMiner and XStream), so adding another
> one for OLR would be rather simple. This approach would simplify
> things from a Flink (CDC) perspective a lot.
> 
> I've just pinged the folks over in the Debezium community on this, it
> would be great to see progress in this matter.
> 
> Best,
> 
> --Gunnar
> 
> 
> Am Do., 5. Jan. 2023 um 20:55 Uhr schrieb Adam Leszczyński
> <aleszczyn...@bersler.com>:
>> 
>> Thanks Leonard, Jark,
>> 
>> I will just reply on the dev list for this topic as this is more related 
>> with development. Sorry, I have sent on 2 lists - I don’t want to add more 
>> chaos here.
>> 
>> The answer to your question is not straight, so I will start from a broader 
>> picture.
>> 
>> Maybe first I will describe some assumptions that I have chosen while 
>> designing OpenLogReplicator. The project is aimed to be minimalistic. It 
>> should only contain the code that is necessary to do parsing of Oracle redo 
>> logs. Nothing more, it should not be a fully functional replicator. So, the 
>> targets are limited to middleware (like Kafka, Flink, some MQ). The amount 
>> of dependencies is reduced to minimal.
>> 
>> The second assumption is to make the project  stateless wherever possible. 
>> The goal is to put on HA (Kubernetes) and store state in Redis (not yet 
>> implemented). But generally OpenLogReplicator should not handle the 
>> information (if possible) about the position of data confirmed by the 
>> receiver. This would allow the receiver to choose way of handling failures 
>> (data to be duplicated on restart, idempotent message).
>> 
>> The third topic is initial data load. There is plenty of available software 
>> for that. There is absolutely no need to duplicate it in this project. No 
>> ETL, selects, etc. My goal is just to track changes.
>> 
>> The fourth assumption is to write code in C++ so that the code is fast, and 
>> I have full control over memory. The code can fully reuse memory and work 
>> also with machines with little memory. This allows easy compilation on 
>> Linux, but maybe in the future also on Solaris, AIX, HP-UX, or even Windows 
>> (if there is demand for that). I think Java is good for some solutions but 
>> not for a binary parser which heavily works with memory and in most cases 
>> uses zero copy approach.
>> 
>> Amount of data in the output is actually defined by source database (how 
>> much data is logged - full schema or just changed columns). I don’t care. 
>> The user defines that what is logged by db. If just primary key and changed 
>> columns - I can send just changed data. If someone wants full schema in 
>> every payload - this is fine too. If schema changes - no problem, I can 
>> provide just DDL commands and process further payloads with new schema.
>> 
>> Format of data - this is actually defined by the receiver. My first choice 
>> was JSON. Next the Debezium guys asked me to support Protobuf. Ok, I have 
>> spend a lot of time and extended the architecture to actually make the code 
>> modular and allow to choose the format of the payload. The writer module can 
>> directly produce json or protobuf payload. Actually that can be extended to 
>> any other format if there is demand for that. Also the json format allows 
>> many options regarding format. I generally don’t test protobuf format code - 
>> I would treat that as a prototype because I know nobody who would like to 
>> use it. This code was planned for Debezium request but so far nobody cares.
>> 
>> For integration with other systems, languages - this is an open case. 
>> Actually I am here agnostic. The data that is produced for output is stored 
>> in a buffer and can be sent to any target. This is done by the Writer module 
>> (you can look at the code) and there is a writer for Kafka, ZeroMQ and even 
>> plain network tcp/ip connection. I don’t understand the question regarding 
>> to adapt that better. If I have a specification I can extend. Say what you 
>> need.
>> 
>> In such case when we have bidirectional connection not like with Kafka - the 
>> receiver can define starting position of data (scn) of the stream he/she 
>> wants to receive.
>> 
>> You can look at the prototype code how this communication would look like: 
>> StreamClient.cpp - but please rather treat that as a working prototype. This 
>> is a client which just connects to OpenLogReplicator using network, and 
>> defines the starting scn and then just receives payload.
>> 
>> In case when:
>> - The connection is broken: the client would reconnect and tell the last 
>> confirmed data and just ask for the following transactions
>> - If OpenLogReplicator crashes - after restart the client would tell the 
>> last confirmed data and ask for the following transactions
>> - If the client crashes - the client would need to recover itself and ask 
>> for the transactions that are after the data that is confirmed by the client
>> 
>> I assume that if the client confirms about some scn that is processed, 
>> OpenLogReplicator can remove that from cache and it is not possible that 
>> after reconnect the client would demand some data that it previously 
>> declared as confirmed.
>> 
>> Well,
>> This is what is currently done, some code was driven by request from the 
>> Debezium team towards future integration, like Support for Protobuf or put 
>> some data to the payload. But never used.
>> We have opened a ticket in their Jira for integration: 
>> https://issues.redhat.com/projects/DBZ/issues/DBZ-2543?filter=allopenissues 
>> But there is no progress and no feedback if they want to make integration or 
>> not. I have made some effort to allow easier integration but I’m not going 
>> to write a Kafka connect code for OpenLogReplicator. I just don’t have 
>> resources for that. I think they focused on their own approach with 
>> LogMiner, waiting for OpenLogReplicator to become more mature before any 
>> integration would be done. If you want to depend Flink integration on the 
>> integration with Debezium. This may never happen.
>> 
>> I was focused recently mostly on making the code stable and releasing 
>> version 1.0 and achieved that point. I am not aware of any problems with the 
>> code that is currently working. The code is aimed to be modular and allow 
>> easy integration, but as you mentioned there is no SDK. Actually this is the 
>> topic that I would like to talk about. Is there reason for some SDK? Would 
>> someone find it useful? Maybe just plain Kafka is enough. Maybe it would be 
>> best if someone took the code and rewrote to Java? But definitely not me - I 
>> would find that nonsense. Java code would suffer.
>> 
>> What kind of interface would be best for Flink?
>> OpenLogReplicator produces payload in protobuf or json. If you want to use 
>> for example xml it would be  waste to write a converter, I would definitely 
>> prefer to add another writer module that would just produce xml instead. If 
>> you need certain format - this is no problem.
>> 
>> But if you want to have full initial data load (snapshot) - this can’t be 
>> done because this project is not for that. You have your own good code.
>> 
>> In practice I think there would be just a few projects which could be the 
>> receiver of data from OpenLogReplicator and there is no reason for writing a 
>> generic SDK for everybody.
>> 
>> My goal was just to start a conversation - discuss if such integration 
>> really makes sense, or not. I really prefer simple architecture, as little 
>> conversions of data as necessary. Not that I would give some format but you 
>> would convert that anyway. This way replication from Oracle can be really 
>> fast.
>> 
>> I’m just about beginning to write tutorials for OpenLogReplicator and the 
>> documentation is out of data. I have a regular daily job which I need to pay 
>> the rent, and a family, and work on this projects just in free time so the 
>> progress is slow. I don’t expect that to change in the future. But in spite 
>> of that, I know companies who already use the code In production and it 
>> works fast and stable. Client’s perspective is that it works 10 times faster 
>> than LogMiner - but this would be dependent on the actual case. You would 
>> need to make a benchmark and test yourself.
>> 
>> 
>> Regards,
>> Adam
>> 
>> 
>> 
>> 
>>> On 5 Jan 2023, at 09:41, Leonard Xu <xbjt...@gmail.com> wrote:
>>> 
>>> Hi, Adam & Márton
>>> 
>>> Thanks for bringing the discussion here.
>>> 
>>> The Flink CDC project provides the Oracle CDC Connector, which can be used 
>>> to capture historical and transaction log data from the Oracle database and 
>>> ingest it into Flink. In the latest version 2.3, the Oracle CDC Connector 
>>> already supports the parallel-incremental snapshot algorithm is supported, 
>>> which supports parallel reading for historical data and lock-free switching 
>>> from historical reading to transaction log reading. In the phase of 
>>> capturing transaction log data, the connector uses Debezium as the library, 
>>> which supports  LogMiner and XStream API to capture change data. IIUC that 
>>> OpenLogReplicator can be used as a third way.
>>> 
>>> For integrating OpenLogReplicator, there are several interesting points 
>>> that we can discuss further:
>>> (1) All Flink CDC connectors do not rely on Kafka or other message queue 
>>> storage, and are directly calculated after data capture. I think the 
>>> network stream way of OpenLogReplicator needs to be adapted better.
>>> (2) The Flink CDC project is mainly developed in Java as well as Flink. 
>>> Does OpenLogReplicator provide Java SDK for easy integration?
>>> (3) If OpenLogReplicator have a plan to be integrated into the Debezium 
>>> project firstly, the Flink CDC project can easily integrate 
>>> OpenLogReplicator by bumping Debezium version.
>>> 
>>> Best,
>>> Leonard
>> 
>>> On 5 Jan 2023, at 04:15, Jark Wu <imj...@gmail.com> wrote:
>>> 
>>> Hi Adam,
>>> 
>>> Thanks for sharing this interesting project. I think it definitely is 
>>> valuable for users for better speed.
>>> 
>>> I am one of the maintainers of flink-cdc-connector project. The project 
>>> offers a “oracle-cdc” connector which uses Debezium (depends on LogMiner) 
>>> as the CDC library. From the perspective of “oracle-cdc” connector, I have 
>>> some questions about OpenLogRelicator:
>>> 
>>> 1) Can OpenLogReplicator provide a Java SDK to allow Flink to communicate 
>>> with Oracle server directly without deploying any other service?
>>> 2) How much overhead on Oracle compared to the LogMiner approach?
>>> 3) Did you discuss this with the Debezium community? I think Debezium might 
>>> be interested in this project as well.
>>> 
>>> 
>>> Best,
>>> Jark
>>> 
>>>> 2023年1月5日 07:32，Adam Leszczyński <aleszczyn...@bersler.com> 写道：
>>>> 
>>>> H Márton,
>>>> 
>>>> Thank you very much for your answer.
>>>> 
>>>> The point with Kafka makes sense. It offers huge bag of potential 
>>>> connectors that could be used.
>>>> But … not everybody wants or needs Kafka. This brings additional 
>>>> architectural
>>>> complication and delays, which might not be acceptable by everybody.
>>>> That’s why you do have your own connectors anyway.
>>>> 
>>>> The Flink connector which reads from Oracle utilizes the LogMiner 
>>>> technology, which
>>>> Is not acceptable for every user. It has big limitation regarding speed.
>>>> You can overcome that only with a binary reader of the database redo log 
>>>> (like 10 times
>>>> faster and delay even up to 50-100ms).
>>>> 
>>>> The reason I am asking is not just to create some additional connector 
>>>> just for fun.
>>>> My main concern is if there is actual demand from users for bigger speed of
>>>> getting changes from the source database or having lower delay.
>>>> You can find a lot of information in the net about differences between a 
>>>> log-based and
>>>> one which is using logminer technology.
>>>> 
>>>> I think, that would be enough for a start. Please tell me what you think 
>>>> about it.
>>>> Would anyone consider using such connector?
>>>> 
>>>> Regards,
>>>> Adam Leszczyński
>>>> 
>>>> 
>>>>> On 4 Jan 2023, at 12:07, Márton Balassi <balassi.mar...@gmail.com> wrote:
>>>>> 
>>>>> (cc Leonard)
>>>>> 
>>>>> Hi Adam,
>>>>> 
>>>>> From an architectural perspective if you land the records to Kafka or 
>>>>> other
>>>>> message broker Flink will be able to process them, at this point I do not
>>>>> see much merit trying to circumvent this step.
>>>>> There is a related project in the Flink space called CDC connectors [1], I
>>>>> highly encourage you to check that out for context and ccd Leonard one of
>>>>> its primary maintainers.
>>>>> 
>>>>> [1] https://github.com/ververica/flink-cdc-connectors/
>>>>> 
>>>>> On Tue, Jan 3, 2023 at 8:40 PM Adam Leszczyński <aleszczyn...@bersler.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi Flink Team,
>>>>>> 
>>>>>> I’m the author of OpenLogReplictor - open source parser of Oracle redo
>>>>>> logs which allows to send transactions
>>>>>> to some message bus. Currently the sink that is implemented is just text
>>>>>> file or Kafka topic.
>>>>>> Also transactions can be sent using plain tcp connection or some message
>>>>>> queue like ZeroMQ.
>>>>>> Code is GPL and all versions from 11.2 are supported. No LogMiner needed.
>>>>>> 
>>>>>> Transactions can be sent using json or protobuf format. Currently the 
>>>>>> code
>>>>>> has reached GA and is actually used in production.
>>>>>> 
>>>>>> The architecture is modular and allows very easily to add other sinks 
>>>>>> like
>>>>>> for example Apache Flink.
>>>>>> Actually I’m going towards approach that OpenLogReplicator could used
>>>>>> Kubernetes and work in HA.
>>>>>> 
>>>>>> Well… that is the general direction. Do you think there could some
>>>>>> application of this soft with Apache Flink?
>>>>>> For example very easily there could be some client which could connect to
>>>>>> OpenLogReplicator using tcp connection
>>>>>> and get transactions and just send them to Apache Flink. An example of
>>>>>> such client is also present in GitHub repo.
>>>>>> https://github.com/bersler/OpenLogReplicator
>>>>>> 
>>>>>> Is there any rational for such integration? Or just a waste of time cause
>>>>>> nobody would use it anyway?
>>>>>> 
>>>>>> Kind regards,
>>>>>> Adam Leszczyński
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>

Re: CDC from Oracle database reading directly logs - integration with OpenLogReplicator

Reply via email to