[GitHub] [pulsar] merlimat opened a new issue, #15954: PIP-174: Provide new implementation for broker dispatch cache

GitBox Mon, 06 Jun 2022 17:43:08 -0700


merlimat opened a new issue, #15954:
URL: https://github.com/apache/pulsar/issues/15954

## Motivation

The current implementation of the read cache in the Pulsar broker has largely
remained unchanged for a long time, except for a few minor tweaks.

While the implementation is stable and reasonably efficient for typical
workloads,
the overhead required for managing the cache evictions in a broker that is
running
many topics can be pretty high in terms of extra CPU utilization and on the
JVM
garbage collection to track an increased number of medium-lived objects.

The goal is to provide an alternative implementation that can adapt better
to
a wider variety of operating conditions.

### Current implementation details

The broker cache is implemented as part of the `ManagedLedger` component,
which sits in the Pulsar broker and provides a higher level of abstraction
of top
of BookKeeper.

Each topic (and managed-ledger) has its own private cache space. This cache
is implemented
as a `ConcurrentSkipList` sorted map that maps `(ledgerId, entryId) ->
payload`. The payload
is a `ByteBuf` reference that can either be a slice of a `ByteBuf` that we
got
when reading from a socket, or it can be a copied buffer.

Each topic cache is allowed to use the full broker max cache size before an
eviction is triggered. The total cache size is effectively a resource shared
across all
the topics, where a topic can use a more prominent portion of it if it "asks
for more".

When the eviction happens, we need to do an expensive ranking of all the
caches in the broker
and do an eviction in a proportional way to the currently used space for
each of them.

The bigger problem is represented by the `ConcurrentSkipList` and the
`ByteBuf` objects
that need to be tracked. The skip list is essentially like a "tree"
structure and needs to
maintain Java objects for each entry in the cache. We also need to
potentially have
a huge number of ByteBuf objects.

A cache workload is typically the worst-case scenario for each garbage
collector implementation because it involves creating objects, storing them
for some amount of
time and then throwing them away. During that time, the GC would have
already tenured these
objects and copy them into an "old generation" space, and sometime later, a
costly compaction
of that memory would have to be performed.

To mitigate the effect of the cache workload on the GC, we're being very
aggressive in
purging the cache by triggering time-based eviction. By putting a max TTL on
the elements in
the cache, we can avoid keeping the objects around for too long to be a
problem for the GC.

The reverse side of this is that we're artificially reducing the cache
capacity to a very
short time frame, reducing the cache usefulness.

The other problem is the CPU cost involved in doing these frequent
evictions, which can
be very high when there are 10s of thousands of topics in a broker.

## Proposed changes

Instead of dealing with individual caches for each topic, let's adopt a
model where
there is a single cache space for the broker.

This cache is broken into N segments which act as a circular buffer.
Whenever a segment
is full, we start writing into the next one, and when we reach the last one,
we will
restart recycling the first segment.

Each segment is composed of a buffer, an offset, and a hashmap which maps
`(ledgerId, entryId) -> offset`.

This model has been working very well for the BookKeeper `ReadCache`:

https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/ReadCache.java

There are 2 main advantages to this approach:

1. Entries are copied into the cache buffer (in direct memory) and we don't
need to keep any
long-lived Java objects around
2. The eviction becomes a completely trivial operation, buffers are just
rotated and
overwritten. We don't need to do any per-topic task or keep track of any
utilization.

### API changes

No user-facing API changes are required.

### New configuration options

The existing cache implementation will not be removed at this point. Users
will
be able to configure the old implementation in `broker.conf`.

This option will be useful in case of performance regressions would be seen
for
some use cases with the new cache implementation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] merlimat opened a new issue, #15954: PIP-174: Provide new implementation for broker dispatch cache

Reply via email to