shfshihuafeng commented on issue #10344:
URL: https://github.com/apache/seatunnel/issues/10344#issuecomment-3789500955
@zhangshenghang Consider streaming mode with CDC if incremental sync
is required
i test mysql cdc with savepoint,but I found that there were some
duplicates data
The original data of the table: 6001215
(1)first ,i import data with following config
(2) When processing 1441836 pieces of data, I manually savepoints
./bin/seatunnel.sh -s {jobId}
(3) i find task stop ,and i query data on hdfs (sink ), it have
1441836
(4) I start the task from the savepoint
./bin/seatunnel.sh -c config -r {jobId}
(5) In the end, there are 7443051 (6001215 + 1441836) ON HDFS
`env {
parallelism = 1
job.mode = "STREAMING"
#checkpoint.interval=30000000
#checkpoint.timeout=1800000
}
source {
MySQL-CDC {
base-url = "jdbc:mysql://localhost:3306"
username = "root"
password = "xxx!"
table-names = ["tpch.LINEITEM_1"]
startup.mode = "initial"
#startup.specific-offset.file="binlog.000019"
#startup.specific-offset.pos="631594286"
}
}
transform {
}
sink {
HdfsFile {
# source_table_name = "et7"
fs.defaultFS = "hdfs://xxx:9000"
batch_size=3000000
is_enable_transaction = false
#have_partition = true
#partition_by = ["city"]
#compress_codec = "zlib"
#partition_dir_expression = "${k0}=${v0}"
path = "/mysql"
file_format_type = "orc"
}
}`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]