[
https://issues.apache.org/jira/browse/HDDS-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siyao Meng updated HDDS-15335:
------------------------------
Description:
NSSummaryTask.process() processes every batch of OM update events Recon
ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout)
it has two avoidable costs: every event triggers a fresh
getBucketTable().getSkipCache(...) RocksDB point read even though
bucket layout and objectID never change; and the three sub-tasks
(FSO / Legacy / OBS) iterate the event list sequentially even though
they operate on disjoint slices and write to disjoint NSSummary
entries.
This patch makes three changes:
1. NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a
field-level Map. After the first lookup for a bucket, subsequent
lookups become HashMap.get() calls.
2. NSSummaryTask.process() submits the three sub-tasks to a 3-thread
pool and joins on all three. The threads see the same event list;
each only processes events whose (table, bucket layout) matches
its target. Target NSSummary entries are disjoint across
sub-tasks so no cross-thread synchronization is needed, and the
TaskResult contract is unchanged.
3. The OBS UPDATE path drops a redundant getKeyParentID(oldKeyInfo)
call: the parent of an OBS key is its bucket, and an UPDATE event
cannot move a key between buckets.
Throughput on Intel Xeon Silver 4416+, 80 CPUs, OpenJDK 17, at 500k
events plus 500k preloaded keys, RATIS replication, mixed 60/30/10
create/update/delete:
| Code | events/sec | vs vanilla |
| -------------------------- | ----------:| ----------:|
| Vanilla | 78,098 | 1.00x |
| + change 1 (cache) | 672,172 | 8.61x |
| + changes 1 and 2 | 918,550 | 11.76x |
Change 1 is the dominant lever: it removes about 1.5M
getSkipCache(bucketDBKey) RocksDB Gets per process() call (3 sub-task
scans of 500k events, each scan doing one or more bucket lookups
before bailing or processing). Change 2 gives a further ~1.37x via JIT
specialization and instruction-cache locality on per-thread hot loops.
Change 3 is below measurement noise.
Heap pressure is reduced because change 1 stops allocating a transient
OmBucketInfo per RocksDB Get. At 1M events / 1M preloaded keys with an
8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms)
and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across
the bench lifetime.
On a 100% FSO workload (fileTable / dirTable / deletedDirTable),
change 1 is a no-op because the FSO sub-task reads
keyInfo.getParentObjectID() directly without a bucket lookup. Change 2
still saves the bail-loop cost of Legacy and OBS scanning the event
list to skip at the table-name check, but that cost is small relative
to FSO's own processing, so the wall-clock speedup on FSO-heavy
workloads is correspondingly smaller. The patch is non-regressive in
any case.
The reproduction harness (NSSummaryProcessTimingTest under -Pbench) is
provided as a companion patch on this JIRA.
All 81 existing TestNSSummaryTask* unit tests pass.
was:
NSSummaryTask is a ReconOmTask that the dispatcher fans out on every batch of
OM RocksDB updates Recon ingests.
Inside its process() method, three sub-tasks (FSO / Legacy / OBS) ran
sequentially even though they operate on disjoint slices of the event stream
(filtered by table and bucket layout) and write to disjoint NSSummary entries.
The Legacy and OBS sub-tasks were also each individually slower than necessary
because every event triggered a fresh RocksDB point read of the corresponding
OmBucketInfo from Recon's local OM snapshot DB (via
{{getBucketTable().getSkipCache(...)}}), even though bucket layout and objectID
never change once a bucket exists.
Changes proposed:
1. NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a
field-level Map keyed by the bucket DB key. Bucket layout/objectID
is immutable for an existing bucket, so an unbounded cache is safe;
cluster bucket count is bounded so memory is not a concern. After
the first event for a given bucket, the cost drops from a RocksDB
point read to a HashMap.get().
2. NSSummaryTask.process() submits each of the three sub-tasks to its
own thread in a 3-thread pool and joins on all three. The threads
do not partition events — all three see the same event list and
each independently iterates it, processing only the events whose
(table, bucket layout) matches its target:
- FSO thread: events on fileTable / dirTable / deletedDirTable.
- Legacy thread: keyTable events whose bucket layout is LEGACY.
- OBS thread: keyTable events whose bucket layout is OBJECT_STORE.
Events that don't match a thread's target are skipped (table-name
check, or bucket-layout check after a now-cached bucket lookup
from change 1). Each sub-task already maintains its own per-call
NSSummary accumulation map and writes to ReconNamespaceSummaryManager
only at flush time via an atomic RDBBatchOperation; the target
NSSummary entries are disjoint between FSO and Legacy/OBS (FSO has
its own namespace tree) and between Legacy and OBS (a bucket has
exactly one layout), so no synchronization is needed across
threads. Per-sub-task seek positions and per-task failure flags
are preserved — same TaskResult contract as before.
3. In the OBS UPDATE path, drop the redundant getKeyParentID(oldKeyInfo)
call. The parent of an OBS key is the bucket, and a key cannot move
between buckets via an UPDATE event (that would be a DELETE+PUT), so
the parent objectID computed for the new key value is identical to
the parent objectID for the old key value.
> Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups
> -------------------------------------------------------------------------
>
> Key: HDDS-15335
> URL: https://issues.apache.org/jira/browse/HDDS-15335
> Project: Apache Ozone
> Issue Type: Improvement
> Components: Ozone Recon
> Reporter: Siyao Meng
> Assignee: Siyao Meng
> Priority: Major
>
> NSSummaryTask.process() processes every batch of OM update events Recon
> ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout)
> it has two avoidable costs: every event triggers a fresh
> getBucketTable().getSkipCache(...) RocksDB point read even though
> bucket layout and objectID never change; and the three sub-tasks
> (FSO / Legacy / OBS) iterate the event list sequentially even though
> they operate on disjoint slices and write to disjoint NSSummary
> entries.
> This patch makes three changes:
> 1. NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a
> field-level Map. After the first lookup for a bucket, subsequent
> lookups become HashMap.get() calls.
> 2. NSSummaryTask.process() submits the three sub-tasks to a 3-thread
> pool and joins on all three. The threads see the same event list;
> each only processes events whose (table, bucket layout) matches
> its target. Target NSSummary entries are disjoint across
> sub-tasks so no cross-thread synchronization is needed, and the
> TaskResult contract is unchanged.
> 3. The OBS UPDATE path drops a redundant getKeyParentID(oldKeyInfo)
> call: the parent of an OBS key is its bucket, and an UPDATE event
> cannot move a key between buckets.
> Throughput on Intel Xeon Silver 4416+, 80 CPUs, OpenJDK 17, at 500k
> events plus 500k preloaded keys, RATIS replication, mixed 60/30/10
> create/update/delete:
> | Code | events/sec | vs vanilla |
> | -------------------------- | ----------:| ----------:|
> | Vanilla | 78,098 | 1.00x |
> | + change 1 (cache) | 672,172 | 8.61x |
> | + changes 1 and 2 | 918,550 | 11.76x |
> Change 1 is the dominant lever: it removes about 1.5M
> getSkipCache(bucketDBKey) RocksDB Gets per process() call (3 sub-task
> scans of 500k events, each scan doing one or more bucket lookups
> before bailing or processing). Change 2 gives a further ~1.37x via JIT
> specialization and instruction-cache locality on per-thread hot loops.
> Change 3 is below measurement noise.
> Heap pressure is reduced because change 1 stops allocating a transient
> OmBucketInfo per RocksDB Get. At 1M events / 1M preloaded keys with an
> 8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms)
> and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across
> the bench lifetime.
> On a 100% FSO workload (fileTable / dirTable / deletedDirTable),
> change 1 is a no-op because the FSO sub-task reads
> keyInfo.getParentObjectID() directly without a bucket lookup. Change 2
> still saves the bail-loop cost of Legacy and OBS scanning the event
> list to skip at the table-name check, but that cost is small relative
> to FSO's own processing, so the wall-clock speedup on FSO-heavy
> workloads is correspondingly smaller. The patch is non-regressive in
> any case.
> The reproduction harness (NSSummaryProcessTimingTest under -Pbench) is
> provided as a companion patch on this JIRA.
> All 81 existing TestNSSummaryTask* unit tests pass.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]