George Jahad created HDDS-7935:
----------------------------------
Summary: LRU Cache entries may get evicted/closed during long
running processes
Key: HDDS-7935
URL: https://issues.apache.org/jira/browse/HDDS-7935
Project: Apache Ozone
Issue Type: Sub-task
Reporter: George Jahad
The way the snapshot LRU cache is implemented, when the oldest snapshot is
evicted, the corresponding rocksdb instance is closed:
https://github.com/apache/ozone/blob/3f7ded2a34c0c35b89901e222ceaee0d1fdf08b6/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java#L124
That is probably fine for shortlived tasks like users reading snapshots, but is
probably not safe for long lived tasks like snap diff and maybe snapshot delete.
The problem is that the cache is currently only refreshed when the snapshot is
initially retrieved from the cache; subsequent reads from the snapshot itself
don't refresh the cache. Thus it is possible for rocksdb instances to be
evicted and closed in the middle of snap diff processing.
One alternative I can think of is to add some kind of reference counting scheme
so that rocksdb instances aren't closed automatically on eviction.
Another possibility is to have an entirely separate pool of snapshot entries,
outside of the cache, that are explicitly opened and closed by long running
tasks like snapdiff.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]