[I] [Feature][Mongodb]Using MongoDB for data migration，How to ensure that the data written to the target MongoDB is consistent with the original data [seatunnel]

via GitHub Sat, 21 Dec 2024 05:56:51 -0800


bunenghulai opened a new issue, #8359:
URL: https://github.com/apache/seatunnel/issues/8359


   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   How to ensure that the data written to the target MongoDB is consistent with 
the original data when using MongoDB for data migration? For example, in the 
document of momgodb, there are three records, each containing the following 
fields: 1. id, name, age 2. id, name 3. id, name, gender. When using SEATUNEL 
for data synchronization, field mapping is required, which requires configuring 
all fields: id, name, age, and gender. It can be seen that the first record 
does not have a gender field, but after being written to the target database, 
the first record will have an additional gender field with a null value. 
Similarly, the second record will have two null fields, age and gender, which 
are clearly inconsistent with the original data. How to ensure that fields with 
null values are automatically filtered out when writing to the target database. 
Secondly, this pre specified field mapping method requires users to know which 
specific fields are included, but some documents may have been 
 created by other developers and do not know which specific fields are included 
when synchronizing data? There may be hundreds of millions of data in a 
document, and it is impossible to view which fields are included in each one 
before configuring the mapping relationship.
   
（使用MongoDB进行数据迁移，如何确保写入目标MongoDB的数据与原始数据一致？例如momgodb的文档中有三条记录，每条记录包含以下字段：1.id、name、age
 
2.id、name3.id、name、gender使用SEATUNEL进行数据同步时，需要进行字段映射，这需要配置全部字段：id、name、age和gender。可以看出第一条记录没有gender字段，但在写入目标数据库后，第一条记录将有一个null值的额外gender字段。同样，第二条记录将有两个空值字段，age和gender，这与原始数据明显不一致。如何能保证在写入目标数据库时，将值为null的字段自动过滤掉。其次，这种提前指定字段映射的方式，需要使用者知道具体都有哪些字段，但是有些文档可能是其他开发人员创建的，同步数据时并不知道具体的字段有哪些？可能一个文档中有几亿条数据，不可能每一条都查看有哪些字段后。再汇总所有字段
 ，去配置映射关系）
   
   ### Usage Scenario
   
   Used for MongoDB cluster data migration, hoping to keep the migrated data 
consistent with the original data. Because when using MongoDB, the number of 
fields contained in each record in the document may not be consistent. The 
current record may contain five fields, and some fields in the next record may 
have null values, which we will not write to the database. This may become 
three fields. This is also an advantage of MongoDB storage. But this will 
result in inconsistent numbers of fields in each record
   
（用于MongoDB集群数据迁移，希望迁移的数据与原始数据保持一致。因为使用MongoDB时，文档中每条记录中包含的字段数量可能不一致。当前记录可能包含五个字段，下一条记录中的某些字段可能为空值，我们不会将其写入数据库。这可能会变成三个字段。这也是MongoDB存储的一个优势。但这将导致每条记录中的字段数量不一致）
   The configuration file is as follows：
   
   env {
     parallelism = 1
     job.mode = "BATCH"
   }
   
   source {
     MongoDB {
       uri = "mongodb://XXX.XX.0.XXX:20003/device"
       database = "device"
       collection = "oaidmd5_${num}"
       match.projection = "{_id:0}"
       partition.split-key = "oaidmd5"
       partition.split-size = 2048
       schema = {
         fields {
           oaidmd5 = String
           age = {
            qtt = {
              0 = Int
              1 = Int
              2 = Int
              3 = Int
            }
           }
           brand = String
           gender = String
           model = String
           oaid = String
           osv = String
           upts = Int
           clk1= {
            vip = {
             51 = Int
             _ttc_ = Int
            }
           }
           interest = {
            5 = Double
            7 = Double
            18 = Double
            14 = Double
           }
           interest_1 = {
            9 = Double
           }
           interest_3 = {
            9 = Double
           }
           interest_7 = {
            5 = Double
            7 = Double
            9 = Double
           }
           interest_14 = {
            9 = Double
           }
           pkg_list = "array<String>"
         }
       }
     }
   }
   sink {
     MongoDB{
       uri = "mongodb://xxx.xxx.xx.xxx:20003/device"
       database = "device"
       collection = "oaidmd5_${num}"
       buffer-flush.max-rows = 2000
       buffer-flush.interval = 1000
       schema = {
         fields {
           oaidmd5 = String
           age = {
            qtt = {
              0 = Int
              1 = Int
              2 = Int
              3 = Int
            }
           }
           brand = String
           gender = String
           model = String
           oaid = String
           osv = String
           upts = Int
           clk1= {
            vip = {
             51 = Int
             _ttc_ = Int
            }
           }
           interest = {
            5 = Double
            7 = Double
            18 = Double
            14 = Double
           }
           interest_1 = {
            9 = Double
           }
           interest_3 = {
            9 = Double
           }
           interest_7 = {
            5 = Double
            7 = Double
            9 = Double
           }
           interest_14 = {
            9 = Double
           }
           pkg_list = "array<String>"
         }
       }
     }
   }
   
   The result is shown in the figure：
   
![image](https://github.com/user-attachments/assets/628befbf-9bbd-44ae-8d82-cd141fbb8f83)
   
   
   
   
   
   
   ### Related issues
   
   no
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Feature][Mongodb]Using MongoDB for data migration，How to ensure that the data written to the target MongoDB is consistent with the original data [seatunnel]

Reply via email to