bhasudha commented on a change in pull request #4235:
URL: https://github.com/apache/hudi/pull/4235#discussion_r764452796



##########
File path: website/docs/hoodie_deltastreamer.md
##########
@@ -210,7 +210,137 @@ A deltastreamer job can then be triggered as follows:
 
 Read more in depth about concurrency control in the [concurrency control 
concepts](/docs/concurrency_control) section
 
+## Checkpointing
+HoodieDeltaStreamer uses checkpoints to keep track of what data has been read 
already so it can resume without needing to reprocess all data.
+When using a Kafka source, the checkpoint is the [Kafka 
Offset](https://cwiki.apache.org/confluence/display/KAFKA/Offset+Management) 
+When using a DFS source, the checkpoint is the 'last modified' timestamp of 
the latest file read.
+Checkpoints are saved in the .hoodie commit file as 
`deltastreamer.checkpoint.key`.
+
+If you need to change the checkpoints for reprocessing or replaying data you 
can use the following options:
+
+- `--checkpoint` will overwrite the current commit file checkpoint.
+- `--source-limit` will set a maximum amount of data to read from the source. 
For DFS sources, this is max # of bytes read.
+For Kafka, this is the max # of events to read.
+- `deltastreamer.checkpoint.reset_key` will temporarily run delta streamer 
from a specific checkpoint, but the current commit file checkpoint 

Review comment:
       We can get rid of this para and just capture the essence as part of the 
--checkpoint para above.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to