[
https://issues.apache.org/jira/browse/HUDI-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-9793:
---------------------------------
Labels: metrics multi-table prometheus pull-request-available (was:
metrics multi-table prometheus)
> Improve PrometheusReporter to support multi-table scenarios with reference
> counting
> -----------------------------------------------------------------------------------
>
> Key: HUDI-9793
> URL: https://issues.apache.org/jira/browse/HUDI-9793
> Project: Apache Hudi
> Issue Type: Improvement
> Components: metrics
> Reporter: Shubham Patel
> Priority: Major
> Labels: metrics, multi-table, prometheus, pull-request-available
>
> h2. Summary
> Currently, {{PrometheusReporter}} stops the HTTP server when any table stops,
> breaking metrics for other tables in multi-table scenarios. This enhancement
> adds reference counting to only stop the server when no tables are using it.
> h2. Current Behavior
> Multiple tables can share the same Prometheus server during startup (thanks
> to HUDI-7083), but when any table stops, the entire server is stopped and
> removed from static maps. This causes other tables to lose access to metrics
> and prevents them from emitting new metrics. While this works fine for
> single-table-per-job scenarios, it breaks multi-table use cases.
> h2. Problem
> Users running multiple Hudi tables in a single Spark job with pause/resume
> functionality experience broken metrics when individual tables are paused or
> stopped.
> h2. Current Behavior (Problematic Flow)
> # Table A starts → Creates Prometheus server on port 9091
> # Table B starts → Reuses same server (works thanks to HUDI-7083)
> # *Table A pauses* → Calls {{PrometheusReporter.stop()}} → *Stops entire
> server*
> # Table B continues → No server available → *Metrics broken*
> h2. Proposed Enhancement
> Add reference counting to track how many tables are using each server port:
> * Add PORT_TO_REFERENCE_COUNT and PORT_TO_EXPORTS maps
> * Increment count when table starts using server
> * Decrement count when table stops, only stop server when count reaches zero
> * Add utility methods for debugging (isServerRunning, getReferenceCount)
> h2. Future Behavior
> # Table A starts → Server starts, reference count = 1
> # Table B starts → Server reused, reference count = 2
> # *Table A pauses* → Reference count = 1, *server stays alive*
> # Table B continues → *Metrics continue working*
> # Table B stops → Reference count = 0, server stops gracefully
> h2. Implementation Details
> The implementation involves modifying the {{PrometheusReporter}} constructor
> to increment reference counts and register exports in a thread-safe manner.
> The stop() method would be updated to decrement reference counts and only
> perform cleanup when no references remain. We would also add utility methods
> for debugging and monitoring server status. All changes would be backward
> compatible, ensuring single-table usage continues to work exactly as before.
> h2. Backward Compatibility
> * Single table usage unchanged - works exactly as before
> * No breaking changes
> * Reference count goes 1→0 for single table, server stops as before
> h2. Use Cases
> * Multi-table Spark jobs with pause/resume functionality
> * Shared Prometheus server across multiple Hudi tables
> * Better resource management for Prometheus servers
> h2. Testing
> * {*}Multi-table reference counting{*}: Basic 2-table scenario
> * {*}Concurrent access{*}: 10 threads creating reporters simultaneously
> * {*}Port isolation{*}: Different ports work independently
> * {*}Partial failure{*}: One table fails while others continue
> * {*}Thread safety{*}: All operations properly synchronized
> * {*}Backward compatibility{*}: Single table behavior unchanged
--
This message was sent by Atlassian Jira
(v8.20.10#820010)