bunenghulai opened a new issue, #8359: URL: https://github.com/apache/seatunnel/issues/8359
### Search before asking - [X] I had searched in the [feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description How to ensure that the data written to the target MongoDB is consistent with the original data when using MongoDB for data migration? For example, in the document of momgodb, there are three records, each containing the following fields: 1. id, name, age 2. id, name 3. id, name, gender. When using SEATUNEL for data synchronization, field mapping is required, which requires configuring all fields: id, name, age, and gender. It can be seen that the first record does not have a gender field, but after being written to the target database, the first record will have an additional gender field with a null value. Similarly, the second record will have two null fields, age and gender, which are clearly inconsistent with the original data. How to ensure that fields with null values are automatically filtered out when writing to the target database. Secondly, this pre specified field mapping method requires users to know which specific fields are included, but some documents may have been created by other developers and do not know which specific fields are included when synchronizing data? There may be hundreds of millions of data in a document, and it is impossible to view which fields are included in each one before configuring the mapping relationship. (使用MongoDB进行数据迁移,如何确保写入目标MongoDB的数据与原始数据一致?例如momgodb的文档中有三条记录,每条记录包含以下字段:1.id、name、age 2.id、name3.id、name、gender使用SEATUNEL进行数据同步时,需要进行字段映射,这需要配置全部字段:id、name、age和gender。可以看出第一条记录没有gender字段,但在写入目标数据库后,第一条记录将有一个null值的额外gender字段。同样,第二条记录将有两个空值字段,age和gender,这与原始数据明显不一致。如何能保证在写入目标数据库时,将值为null的字段自动过滤掉。其次,这种提前指定字段映射的方式,需要使用者知道具体都有哪些字段,但是有些文档可能是其他开发人员创建的,同步数据时并不知道具体的字段有哪些?可能一个文档中有几亿条数据,不可能每一条都查看有哪些字段后。再汇总所有字段 ,去配置映射关系) ### Usage Scenario Used for MongoDB cluster data migration, hoping to keep the migrated data consistent with the original data. Because when using MongoDB, the number of fields contained in each record in the document may not be consistent. The current record may contain five fields, and some fields in the next record may have null values, which we will not write to the database. This may become three fields. This is also an advantage of MongoDB storage. But this will result in inconsistent numbers of fields in each record (用于MongoDB集群数据迁移,希望迁移的数据与原始数据保持一致。因为使用MongoDB时,文档中每条记录中包含的字段数量可能不一致。当前记录可能包含五个字段,下一条记录中的某些字段可能为空值,我们不会将其写入数据库。这可能会变成三个字段。这也是MongoDB存储的一个优势。但这将导致每条记录中的字段数量不一致) The configuration file is as follows: env { parallelism = 1 job.mode = "BATCH" } source { MongoDB { uri = "mongodb://XXX.XX.0.XXX:20003/device" database = "device" collection = "oaidmd5_${num}" match.projection = "{_id:0}" partition.split-key = "oaidmd5" partition.split-size = 2048 schema = { fields { oaidmd5 = String age = { qtt = { 0 = Int 1 = Int 2 = Int 3 = Int } } brand = String gender = String model = String oaid = String osv = String upts = Int clk1= { vip = { 51 = Int _ttc_ = Int } } interest = { 5 = Double 7 = Double 18 = Double 14 = Double } interest_1 = { 9 = Double } interest_3 = { 9 = Double } interest_7 = { 5 = Double 7 = Double 9 = Double } interest_14 = { 9 = Double } pkg_list = "array<String>" } } } } sink { MongoDB{ uri = "mongodb://xxx.xxx.xx.xxx:20003/device" database = "device" collection = "oaidmd5_${num}" buffer-flush.max-rows = 2000 buffer-flush.interval = 1000 schema = { fields { oaidmd5 = String age = { qtt = { 0 = Int 1 = Int 2 = Int 3 = Int } } brand = String gender = String model = String oaid = String osv = String upts = Int clk1= { vip = { 51 = Int _ttc_ = Int } } interest = { 5 = Double 7 = Double 18 = Double 14 = Double } interest_1 = { 9 = Double } interest_3 = { 9 = Double } interest_7 = { 5 = Double 7 = Double 9 = Double } interest_14 = { 9 = Double } pkg_list = "array<String>" } } } } The result is shown in the figure:  ### Related issues no ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
