Shubham Patel created HUDI-9793:
-----------------------------------

             Summary: Improve PrometheusReporter to support multi-table 
scenarios with reference counting
                 Key: HUDI-9793
                 URL: https://issues.apache.org/jira/browse/HUDI-9793
             Project: Apache Hudi
          Issue Type: Improvement
          Components: metrics
            Reporter: Shubham Patel


h2. Summary

Currently, {{PrometheusReporter}} stops the HTTP server when any table stops, 
breaking metrics for other tables in multi-table scenarios. This enhancement 
adds reference counting to only stop the server when no tables are using it.
h2. Current Behavior

Multiple tables can share the same Prometheus server during startup (thanks to 
HUDI-7083), but when any table stops, the entire server is stopped and removed 
from static maps. This causes other tables to lose access to metrics and 
prevents them from emitting new metrics. While this works fine for 
single-table-per-job scenarios, it breaks multi-table use cases.
h2. Problem

Users running multiple Hudi tables in a single Spark job with pause/resume 
functionality experience broken metrics when individual tables are paused or 
stopped.
h2. Current Behavior (Problematic Flow)
 # Table A starts → Creates Prometheus server on port 9091
 # Table B starts → Reuses same server (works thanks to HUDI-7083)
 # *Table A pauses* → Calls {{PrometheusReporter.stop()}} → *Stops entire 
server*
 # Table B continues → No server available → *Metrics broken*

h2. Proposed Enhancement

Add reference counting to track how many tables are using each server port:
 * Add PORT_TO_REFERENCE_COUNT and PORT_TO_EXPORTS maps
 * Increment count when table starts using server
 * Decrement count when table stops, only stop server when count reaches zero
 * Add utility methods for debugging (isServerRunning, getReferenceCount)

h2. Future Behavior
 # Table A starts → Server starts, reference count = 1
 # Table B starts → Server reused, reference count = 2
 # *Table A pauses* → Reference count = 1, *server stays alive*
 # Table B continues → *Metrics continue working*
 # Table B stops → Reference count = 0, server stops gracefully

h2. Implementation Details

The implementation involves modifying the {{PrometheusReporter}} constructor to 
increment reference counts and register exports in a thread-safe manner. The 
stop() method would be updated to decrement reference counts and only perform 
cleanup when no references remain. We would also add utility methods for 
debugging and monitoring server status. All changes would be backward 
compatible, ensuring single-table usage continues to work exactly as before.
h2. Backward Compatibility
 * Single table usage unchanged - works exactly as before
 * No breaking changes
 * Reference count goes 1→0 for single table, server stops as before

h2. Use Cases
 * Multi-table Spark jobs with pause/resume functionality
 * Shared Prometheus server across multiple Hudi tables
 * Better resource management for Prometheus servers

h2. Testing
 * {*}Multi-table reference counting{*}: Basic 2-table scenario
 * {*}Concurrent access{*}: 10 threads creating reporters simultaneously
 * {*}Port isolation{*}: Different ports work independently
 * {*}Partial failure{*}: One table fails while others continue
 * {*}Thread safety{*}: All operations properly synchronized
 * {*}Backward compatibility{*}: Single table behavior unchanged



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to