Re: [PR] docs: update flink related docs as of 1.1 [hudi]

via GitHub Sat, 22 Nov 2025 21:50:49 -0800


danny0405 commented on code in PR #14320:
URL: https://github.com/apache/hudi/pull/14320#discussion_r2553804147



##########
website/docs/ingestion_flink.md:
##########
@@ -1,179 +1,361 @@
 ---
 title: Using Flink
 keywords: [hudi, flink, streamer, ingestion]
+last_modified_at: 2025-11-22T12:53:57+08:00
 ---
 
-### CDC Ingestion
-CDC(change data capture) keep track of the data changes evolving in a source 
system so a downstream process or system can action that change.
+## CDC Ingestion
+
+CDC (change data capture) keeps track of data changes evolving in a source 
system so a downstream process or system can act on those changes.
 We recommend two ways for syncing CDC data into Hudi:
 
 ![slide1 title](/assets/images/cdc-2-hudi.png)
 
-1. Using the Ververica 
[flink-cdc-connectors](https://github.com/ververica/flink-cdc-connectors) 
directly connect to DB Server to sync the binlog data into Hudi.
-   The advantage is that it does not rely on message queues, but the 
disadvantage is that it puts pressure on the db server;
-2. Consume data from a message queue (for e.g, the Kafka) using the flink cdc 
format, the advantage is that it is highly scalable,
+1. Use the Ververica 
[flink-cdc-connectors](https://github.com/ververica/flink-cdc-connectors) to 
directly connect to the database server and sync binlog data into Hudi.
+   The advantage is that it does not rely on message queues, but the 
disadvantage is that it puts pressure on the database server.
+2. Consume data from a message queue (e.g., Kafka) using the Flink CDC format. 
The advantage is that it is highly scalable,
    but the disadvantage is that it relies on message queues.
 
 :::note
-- If the upstream data cannot guarantee the order, you need to specify option 
`write.precombine.field` explicitly;
+If the upstream data cannot guarantee ordering, you need to explicitly specify 
the `write.precombine.field` option.
 :::
 
-### Bulk Insert
+## Bulk Insert
 
-For the demand of snapshot data import. If the snapshot data comes from other 
data sources, use the `bulk_insert` mode to quickly
+For snapshot data import requirements, if the snapshot data comes from other 
data sources, use the `bulk_insert` mode to quickly
 import the snapshot data into Hudi.
 
-
 :::note
-`bulk_insert` eliminates the serialization and data merging. The data 
deduplication is skipped, so the user need to guarantee the uniqueness of the 
data.
+`bulk_insert` eliminates serialization and data merging. Data deduplication is 
skipped, so the user needs to guarantee data uniqueness.
 :::
 
 :::note
-`bulk_insert` is more efficient in the `batch execution mode`. By default, the 
`batch execution mode` sorts the input records
-by the partition path and writes these records to Hudi, which can avoid write 
performance degradation caused by
-frequent `file handle` switching.  
+`bulk_insert` is more efficient in `batch execution mode`. By default, `batch 
execution mode` sorts the input records
+by partition path and writes these records to Hudi, which can avoid 
write‑performance degradation caused by
+frequent file‑handle switching.
 :::
 
-:::note  
-The parallelism of `bulk_insert` is specified by `write.tasks`. The 
parallelism will affect the number of small files.
-In theory, the parallelism of `bulk_insert` is the number of `bucket`s (In 
particular, when each bucket writes to maximum file size, it
-will rollover to the new file handle. Finally, `the number of files` >= 
[`write.bucket_assign.tasks`](configurations#writebucket_assigntasks).
+:::note
+The parallelism of `bulk_insert` is specified by `write.tasks`. The 
parallelism affects the number of small files.
+In theory, the parallelism of `bulk_insert` equals the number of buckets. (In 
particular, when each bucket writes to the maximum file size, it
+rolls over to a new file handle.) Finally, the number of files ≥ 
[`write.bucket_assign.tasks`](configurations#writebucket_assigntasks).

Review Comment:
   "Finally, the number of files ≥" -> The final files number is greater than 
or equals to ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: update flink related docs as of 1.1 [hudi]

Reply via email to