Shubham Patel created HUDI-9793:
-----------------------------------
Summary: Improve PrometheusReporter to support multi-table
scenarios with reference counting
Key: HUDI-9793
URL: https://issues.apache.org/jira/browse/HUDI-9793
Project: Apache Hudi
Issue Type: Improvement
Components: metrics
Reporter: Shubham Patel
h2. Summary
Currently, {{PrometheusReporter}} stops the HTTP server when any table stops,
breaking metrics for other tables in multi-table scenarios. This enhancement
adds reference counting to only stop the server when no tables are using it.
h2. Current Behavior
Multiple tables can share the same Prometheus server during startup (thanks to
HUDI-7083), but when any table stops, the entire server is stopped and removed
from static maps. This causes other tables to lose access to metrics and
prevents them from emitting new metrics. While this works fine for
single-table-per-job scenarios, it breaks multi-table use cases.
h2. Problem
Users running multiple Hudi tables in a single Spark job with pause/resume
functionality experience broken metrics when individual tables are paused or
stopped.
h2. Current Behavior (Problematic Flow)
# Table A starts → Creates Prometheus server on port 9091
# Table B starts → Reuses same server (works thanks to HUDI-7083)
# *Table A pauses* → Calls {{PrometheusReporter.stop()}} → *Stops entire
server*
# Table B continues → No server available → *Metrics broken*
h2. Proposed Enhancement
Add reference counting to track how many tables are using each server port:
* Add PORT_TO_REFERENCE_COUNT and PORT_TO_EXPORTS maps
* Increment count when table starts using server
* Decrement count when table stops, only stop server when count reaches zero
* Add utility methods for debugging (isServerRunning, getReferenceCount)
h2. Future Behavior
# Table A starts → Server starts, reference count = 1
# Table B starts → Server reused, reference count = 2
# *Table A pauses* → Reference count = 1, *server stays alive*
# Table B continues → *Metrics continue working*
# Table B stops → Reference count = 0, server stops gracefully
h2. Implementation Details
The implementation involves modifying the {{PrometheusReporter}} constructor to
increment reference counts and register exports in a thread-safe manner. The
stop() method would be updated to decrement reference counts and only perform
cleanup when no references remain. We would also add utility methods for
debugging and monitoring server status. All changes would be backward
compatible, ensuring single-table usage continues to work exactly as before.
h2. Backward Compatibility
* Single table usage unchanged - works exactly as before
* No breaking changes
* Reference count goes 1→0 for single table, server stops as before
h2. Use Cases
* Multi-table Spark jobs with pause/resume functionality
* Shared Prometheus server across multiple Hudi tables
* Better resource management for Prometheus servers
h2. Testing
* {*}Multi-table reference counting{*}: Basic 2-table scenario
* {*}Concurrent access{*}: 10 threads creating reporters simultaneously
* {*}Port isolation{*}: Different ports work independently
* {*}Partial failure{*}: One table fails while others continue
* {*}Thread safety{*}: All operations properly synchronized
* {*}Backward compatibility{*}: Single table behavior unchanged
--
This message was sent by Atlassian Jira
(v8.20.10#820010)