danny0405 commented on code in PR #14320: URL: https://github.com/apache/hudi/pull/14320#discussion_r2553804147
########## website/docs/ingestion_flink.md: ########## @@ -1,179 +1,361 @@ --- title: Using Flink keywords: [hudi, flink, streamer, ingestion] +last_modified_at: 2025-11-22T12:53:57+08:00 --- -### CDC Ingestion -CDC(change data capture) keep track of the data changes evolving in a source system so a downstream process or system can action that change. +## CDC Ingestion + +CDC (change data capture) keeps track of data changes evolving in a source system so a downstream process or system can act on those changes. We recommend two ways for syncing CDC data into Hudi:  -1. Using the Ververica [flink-cdc-connectors](https://github.com/ververica/flink-cdc-connectors) directly connect to DB Server to sync the binlog data into Hudi. - The advantage is that it does not rely on message queues, but the disadvantage is that it puts pressure on the db server; -2. Consume data from a message queue (for e.g, the Kafka) using the flink cdc format, the advantage is that it is highly scalable, +1. Use the Ververica [flink-cdc-connectors](https://github.com/ververica/flink-cdc-connectors) to directly connect to the database server and sync binlog data into Hudi. + The advantage is that it does not rely on message queues, but the disadvantage is that it puts pressure on the database server. +2. Consume data from a message queue (e.g., Kafka) using the Flink CDC format. The advantage is that it is highly scalable, but the disadvantage is that it relies on message queues. :::note -- If the upstream data cannot guarantee the order, you need to specify option `write.precombine.field` explicitly; +If the upstream data cannot guarantee ordering, you need to explicitly specify the `write.precombine.field` option. ::: -### Bulk Insert +## Bulk Insert -For the demand of snapshot data import. If the snapshot data comes from other data sources, use the `bulk_insert` mode to quickly +For snapshot data import requirements, if the snapshot data comes from other data sources, use the `bulk_insert` mode to quickly import the snapshot data into Hudi. - :::note -`bulk_insert` eliminates the serialization and data merging. The data deduplication is skipped, so the user need to guarantee the uniqueness of the data. +`bulk_insert` eliminates serialization and data merging. Data deduplication is skipped, so the user needs to guarantee data uniqueness. ::: :::note -`bulk_insert` is more efficient in the `batch execution mode`. By default, the `batch execution mode` sorts the input records -by the partition path and writes these records to Hudi, which can avoid write performance degradation caused by -frequent `file handle` switching. +`bulk_insert` is more efficient in `batch execution mode`. By default, `batch execution mode` sorts the input records +by partition path and writes these records to Hudi, which can avoid write‑performance degradation caused by +frequent file‑handle switching. ::: -:::note -The parallelism of `bulk_insert` is specified by `write.tasks`. The parallelism will affect the number of small files. -In theory, the parallelism of `bulk_insert` is the number of `bucket`s (In particular, when each bucket writes to maximum file size, it -will rollover to the new file handle. Finally, `the number of files` >= [`write.bucket_assign.tasks`](configurations#writebucket_assigntasks). +:::note +The parallelism of `bulk_insert` is specified by `write.tasks`. The parallelism affects the number of small files. +In theory, the parallelism of `bulk_insert` equals the number of buckets. (In particular, when each bucket writes to the maximum file size, it +rolls over to a new file handle.) Finally, the number of files ≥ [`write.bucket_assign.tasks`](configurations#writebucket_assigntasks). Review Comment: "Finally, the number of files ≥" -> The final files number is greater than or equals to ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
