shfshihuafeng commented on issue #10344:
URL: https://github.com/apache/seatunnel/issues/10344#issuecomment-3789500955

   @zhangshenghang       Consider streaming mode with CDC if incremental sync 
is required 
   
   i test mysql cdc  with savepoint,but    I found that there were  some 
duplicates data 
   
    The original data of the table: 6001215 
   
    (1)first ,i import  data with following config
   
       (2) When processing 1441836 pieces of data, I manually savepoints
   
            ./bin/seatunnel.sh -s {jobId}
   
        (3)  i find  task stop ,and  i  query data on hdfs (sink ), it have 
1441836 
   
   
        (4)  I start the task from the savepoint 
   
       ./bin/seatunnel.sh -c  config   -r {jobId}
   
       (5) In the end, there are 7443051 (6001215 +  1441836) ON  HDFS
   
   
   `env {
     parallelism = 1
     job.mode = "STREAMING"
     #checkpoint.interval=30000000
     #checkpoint.timeout=1800000
   }
   
   
   source {
      MySQL-CDC {
       base-url = "jdbc:mysql://localhost:3306"
       username = "root"
       password = "xxx!"
       table-names = ["tpch.LINEITEM_1"]
       startup.mode = "initial"
       #startup.specific-offset.file="binlog.000019"
       #startup.specific-offset.pos="631594286"
     }
   
   }
   
   
   transform {
   
   }
   sink {
    HdfsFile {
    #   source_table_name = "et7"
       fs.defaultFS = "hdfs://xxx:9000"
       batch_size=3000000
       is_enable_transaction = false
       #have_partition = true
       #partition_by = ["city"]
       #compress_codec = "zlib"
       #partition_dir_expression = "${k0}=${v0}"
       path = "/mysql"
       file_format_type = "orc"
       }
   
   }`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to