This is an automated email from the ASF dual-hosted git repository.
ipolyzos pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fluss.git
The following commit(s) were added to refs/heads/main by this push:
new b09a4ee06 1794 : Add graceful shutdown documentation (#1796)
b09a4ee06 is described below
commit b09a4ee0695b53d904c0355044422b7553a19a21
Author: Hemanth Savasere <[email protected]>
AuthorDate: Sun Oct 12 22:27:11 2025 +0530
1794 : Add graceful shutdown documentation (#1796)
* docs(maintenance): add graceful shutdown procedures documentation
Add comprehensive documentation for graceful shutdown procedures in Fluss,
covering server shutdown processes, component-specific shutdown sequences,
best practices, and troubleshooting guidelines. The document provides
implementation details for both Coordinator and Tablet servers, along
with configuration references and monitoring recommendations.
* Changes done to remove unecessary docs
* add some improvements
---------
Co-authored-by: Hemanth <[email protected]>
Co-authored-by: ipolyzos <[email protected]>
---
.../maintenance/operations/graceful-shutdown.md | 121 +++++++++++++++++++++
1 file changed, 121 insertions(+)
diff --git a/website/docs/maintenance/operations/graceful-shutdown.md
b/website/docs/maintenance/operations/graceful-shutdown.md
new file mode 100644
index 000000000..7a183c3a5
--- /dev/null
+++ b/website/docs/maintenance/operations/graceful-shutdown.md
@@ -0,0 +1,121 @@
+# Graceful Shutdown
+
+Apache Fluss provides a **comprehensive graceful shutdown mechanism** to
ensure data integrity and proper resource cleanup when stopping servers or
services.
+
+This guide describes the shutdown procedures, configuration options, and best
practices for each Fluss component.
+
+## Overview
+
+Graceful shutdown in Fluss ensures that:
+- All ongoing operations complete safely
+- Resources are properly released
+- Data consistency is maintained
+- Network connections are cleanly closed
+- Background tasks are terminated properly
+
+These guarantees prevent data corruption and ensure smooth restarts of the
system.
+
+## Server Shutdown
+
+### Coordinator Server Shutdown
+
+The **Coordinator Server** uses a multi-stage shutdown process to safely
terminate all services in the correct order.
+#### Shutdown Process
+1. **Shutdown Hook Registration**: The server registers a JVM shutdown hook
that triggers graceful shutdown on process termination
+2. **Service Termination**: All services are stopped in a specific order to
maintain consistency:
+
+ **Coordinator Server Shutdown Order:**
+ 1. Server Metric Group → Metric Registry (async)
+ 2. Auto Partition Manager → IO Executor (5s timeout)
+ 3. Coordinator Event Processor → Coordinator Channel Manager
+ 4. RPC Server (async) → Coordinator Service
+ 5. Coordinator Context → Lake Table Tiering Manager
+ 6. ZooKeeper Client → Authorizer
+ 7. Dynamic Config Manager → Lake Catalog Dynamic Loader
+ 8. RPC Client → Client Metric Group
+
+3. **Resource Cleanup**: Executors, connections, and other resources are
properly closed
+
+```bash
+# Graceful shutdown via SIGTERM
+kill -TERM <coordinator-pid>
+
+# Or using the shutdown script (if available)
+./bin/stop-coordinator.sh
+```
+
+### Tablet Server Shutdown
+
+The **Tablet Server** supports a **controlled shutdown process** designed to
minimize data unavailability and ensure leadership handover before termination.
+
+**Shutdown Order:**
+1. Tablet Server Metric Group → Metric Registry (async)
+2. RPC Server (async) → Tablet Service
+3. ZooKeeper Client → RPC Client → Client Metric Group
+4. Scheduler → KV Manager → Remote Log Manager
+5. Log Manager → Replica Manager
+6. Authorizer → Dynamic Config Manager → Lake Catalog Dynamic Loader
+
+#### Controlled Shutdown Process
+
+1. **Leadership Transfer**: The server attempts to transfer leadership of all
buckets it leads to other replicas
+2. **Retry Logic**: If leadership transfer fails, the server retries with
configurable intervals
+3. **Timeout Handling**: After maximum retries, the server proceeds with
unclean shutdown if necessary
+
+```bash
+# Initiate controlled shutdown
+kill -TERM <tablet-server-pid>
+```
+
+#### Configuration Options
+
+- **Controlled Shutdown Retries**: Number of attempts to transfer leadership
(`default:` 3 retries)
+- **Retry Interval**: Time between retry attempts (`default`: 1000L)
+
+## Monitoring Shutdown
+
+### Logging
+
+Fluss provides detailed logging during shutdown processes:
+
+- **INFO**: Normal shutdown progress
+- **WARN**: Retry attempts or timeout warnings
+- **ERROR**: Shutdown failures or exceptions
+
+### Metrics
+
+Monitor shutdown-related metrics:
+
+- Shutdown duration
+- Failed shutdown attempts
+- Resource cleanup status
+
+## Troubleshooting
+
+### Common Issues
+| Issue | Possible Causes
| Recommended Actions
|
+| -------------------- |
--------------------------------------------------------------- |
-------------------------------------------------------------------------------
|
+| **Hanging shutdown** | Blocking operations, thread pool misconfiguration, or
deadlocks | Check for blocking calls without timeouts, inspect thread dumps
|
+| **Resource leaks** | Unclosed resources or connections
| Verify all `AutoCloseable` resources and file handles are closed
|
+| **Data loss** | Unclean shutdown or failed leadership transfer
| Always use controlled shutdown for Tablet Servers and verify
replication factor |
+
+### Debug Steps
+
+1. Enable debug logging for shutdown components
+2. Monitor JVM thread dumps during shutdown
+3. Check system resource usage
+4. Verify network connection states
+
+## Configuration Reference
+
+| Configuration | Description | Default |
+|---------------|-------------|---------|
+| `controlled.shutdown.max.retries` | Maximum retries for controlled shutdown
| 3 |
+| `controlled.shutdown.retry.interval.ms` | Interval between retry attempts |
5000 |
+| `shutdown.timeout.ms` | General shutdown timeout | 30000 |
+
+## See Also
+
+- [Configuration](../configuration.md)
+- [Monitoring and Observability](../observability/monitor-metrics.md)
+- [Upgrading Fluss](upgrading.md)
\ No newline at end of file