Hi Anand,

Thanks for sharing this information!

I did not dig deep into it, but just from a quick glance, I believe
synchronous execution on Vert.x threads is a concern. Normally, Polaris
requests should be handled on the "executor" threads.

I'll make a second pass at this later (hopefully not too late :)

Cheers,
Dmitri.

On Tue, May 5, 2026 at 9:33 AM Anand Kumar Sankaran via dev <
[email protected]> wrote:

> Hi all,
>
> I up took 1.4.1 and turned on table metrics persistence.
>
> For this time around, I only wanted to persist the CommitReport (much
> lower volume than ScanReport). I created a CompositeMetricsReporter which
> did the following (log the table metrics as of 1.3.0, a Prometheus reporter
> for some metrics sent to Prometheus and a persisting reporter to persist
> just the CommitReport):
>
>
> public CompositeMetricsReporter(
>     @Identifier("default") final PolarisMetricsReporter logging,
>     @Identifier("persisting") final PolarisMetricsReporter persisting,
>     @Identifier("prometheus") final PolarisMetricsReporter prometheus,
>     final ThreadContext threadContext,
>     final MeterRegistry meterRegistry) {
>
> CommitReport persistence (the PolarisMetricsReporter "persisting"
> delegate, which writes commit history to the metastore) was executing
> synchronously on the Vert.x worker thread that  handled each commit
> request. When Aurora Serverless v2 entered a cold-start/scaling phase, each
> JDBC call took seconds instead of milliseconds. With enough concurrent
> commits, Vert.x's  worker thread pool saturated and new requests began
> failing with 503.
>
> To fix this, I had to do the following:
>
> Async, bounded persistence executor in CompositeMetricsReporter.
>
> CommitReport persistence is now dispatched onto a dedicated
> ThreadPoolExecutor (4 threads, queue capacity 1024) rather than the calling
> thread. Tasks are wrapped via ThreadContext  (MicroProfile Context
> Propagation) so request-scoped CDI beans (CallContext, PolarisPrincipal,
> RequestIdSupplier) remain accessible on the executor thread. When the queue
> is full, the task is dropped (not enqueued to grow memory) and counted via
> a new metric catalog_metrics_persistence_dropped. Queue depth is exposed as
> catalog_metrics_persistence_queue_size.  These two metrics give me a window
> into how to tune this further.
>
> ScanReport persistence was also disabled — scan reports represent
> read-path volume and were causing write amplification under read-heavy
> workloads; they are now only passed to the  logging and Prometheus
> delegates.
>
> I also had to bump up the minimum capacity of the serverless Aurora
> instance.
>
> When we were implementing the PR, we had discussed implementing a
> different data source for table metrics but abandoned this.  I request that
> we look at that again now.  Anyone using the persisting table metrics
> identifier needs to be careful.
>
> -
> Anand
>

Reply via email to