[ 
https://issues.apache.org/jira/browse/HUDI-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Patel updated HUDI-9793:
--------------------------------
    Status: In Progress  (was: Open)

> Improve PrometheusReporter to support multi-table scenarios with reference 
> counting
> -----------------------------------------------------------------------------------
>
>                 Key: HUDI-9793
>                 URL: https://issues.apache.org/jira/browse/HUDI-9793
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Shubham Patel
>            Priority: Major
>              Labels: metrics, multi-table, prometheus
>
> h2. Summary
> Currently, {{PrometheusReporter}} stops the HTTP server when any table stops, 
> breaking metrics for other tables in multi-table scenarios. This enhancement 
> adds reference counting to only stop the server when no tables are using it.
> h2. Current Behavior
> Multiple tables can share the same Prometheus server during startup (thanks 
> to HUDI-7083), but when any table stops, the entire server is stopped and 
> removed from static maps. This causes other tables to lose access to metrics 
> and prevents them from emitting new metrics. While this works fine for 
> single-table-per-job scenarios, it breaks multi-table use cases.
> h2. Problem
> Users running multiple Hudi tables in a single Spark job with pause/resume 
> functionality experience broken metrics when individual tables are paused or 
> stopped.
> h2. Current Behavior (Problematic Flow)
>  # Table A starts → Creates Prometheus server on port 9091
>  # Table B starts → Reuses same server (works thanks to HUDI-7083)
>  # *Table A pauses* → Calls {{PrometheusReporter.stop()}} → *Stops entire 
> server*
>  # Table B continues → No server available → *Metrics broken*
> h2. Proposed Enhancement
> Add reference counting to track how many tables are using each server port:
>  * Add PORT_TO_REFERENCE_COUNT and PORT_TO_EXPORTS maps
>  * Increment count when table starts using server
>  * Decrement count when table stops, only stop server when count reaches zero
>  * Add utility methods for debugging (isServerRunning, getReferenceCount)
> h2. Future Behavior
>  # Table A starts → Server starts, reference count = 1
>  # Table B starts → Server reused, reference count = 2
>  # *Table A pauses* → Reference count = 1, *server stays alive*
>  # Table B continues → *Metrics continue working*
>  # Table B stops → Reference count = 0, server stops gracefully
> h2. Implementation Details
> The implementation involves modifying the {{PrometheusReporter}} constructor 
> to increment reference counts and register exports in a thread-safe manner. 
> The stop() method would be updated to decrement reference counts and only 
> perform cleanup when no references remain. We would also add utility methods 
> for debugging and monitoring server status. All changes would be backward 
> compatible, ensuring single-table usage continues to work exactly as before.
> h2. Backward Compatibility
>  * Single table usage unchanged - works exactly as before
>  * No breaking changes
>  * Reference count goes 1→0 for single table, server stops as before
> h2. Use Cases
>  * Multi-table Spark jobs with pause/resume functionality
>  * Shared Prometheus server across multiple Hudi tables
>  * Better resource management for Prometheus servers
> h2. Testing
>  * {*}Multi-table reference counting{*}: Basic 2-table scenario
>  * {*}Concurrent access{*}: 10 threads creating reporters simultaneously
>  * {*}Port isolation{*}: Different ports work independently
>  * {*}Partial failure{*}: One table fails while others continue
>  * {*}Thread safety{*}: All operations properly synchronized
>  * {*}Backward compatibility{*}: Single table behavior unchanged



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to