turboFei opened a new pull request, #16693:
URL: https://github.com/apache/iceberg/pull/16693
## Problem
`RESTMetricsReporter.report()` uses
`Tasks.range(1).executeWith(METRICS_EXECUTOR).run()` to submit the HTTP metrics
call to a single-background-thread executor. However, after submitting the
task, the **calling thread is blocked** inside `Tasks.waitFor()` — a
`sleep(10ms)` poll loop that spins until the submitted future is done:
```java
// Tasks.waitFor() — called by runParallel() after submitting to
METRICS_EXECUTOR
while (true) {
if (allDone(futures)) { ... return; }
Thread.sleep(10); // ← calling thread is blocked here
}
```
This makes `report()` **synchronous** from the caller's perspective, despite
appearing to use an executor.
### Thread pile-up in practice
The intended use of this API is async, best-effort metrics reporting. Any
caller that wraps `report()` in its own thread pool to make it non-blocking
will encounter a severe thread leak:
```
caller pool thread
→ report()
→ Tasks.range(1).executeWith(METRICS_EXECUTOR).run()
→ submits HTTP call to METRICS_EXECUTOR (1 thread)
→ blocks in Tasks.waitFor(), waiting for METRICS_EXECUTOR
↑
caller thread is STUCK here until HTTP completes
```
If `report()` is called faster than `METRICS_EXECUTOR` can process (e.g.
many concurrent Iceberg scans/commits), the caller pool keeps growing while its
threads pile up in `waitFor()`. In production we observed **9,351
`Iceberg-CompactionReport-Thread` threads** accumulating at ~2.7 threads/second
over several hours, consuming several GB of thread stack memory.
## Fix
Replace `Tasks.range(1).executeWith(METRICS_EXECUTOR).run()` with a direct
`METRICS_EXECUTOR.execute(lambda)`:
```java
METRICS_EXECUTOR.execute(() -> {
try {
client.post(...);
} catch (Exception e) {
LOG.warn("Failed to report metrics ...", e);
}
});
```
`report()` now enqueues the HTTP call and **returns immediately**. The
background thread processes reports serially; callers are never blocked.
Switch `METRICS_EXECUTOR` from `ThreadPools.newExitingWorkerPool` (which
uses `LinkedBlockingQueue` internally but registers a JVM shutdown hook) to a
plain `ThreadPoolExecutor` with daemon threads and an unbounded
`LinkedBlockingQueue`. Daemon threads exit automatically when the JVM shuts
down, making the shutdown hook unnecessary.
## Testing
`TestRESTMetricsReporter` verifies three things:
1. **`reportIsNonBlocking`** — `report()` returns in < 2 s even when the
HTTP client blocks for 5 s
2. **`reportSendsHttpPost`** — the HTTP POST is eventually delivered to the
client
3. **`reportNullIsIgnored`** — a null report does not trigger any HTTP call
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]