[ 
https://issues.apache.org/jira/browse/HDDS-8765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duong updated HDDS-8765:
------------------------
    Description: 
h2. Problem

Today, OM manages resource (volume, bucket) locks by LockManager, a component 
enclosing a lock table as a ConcurrentHashMap to store all active locks. Locks 
are dynamically allocated and destroyed in the lock table based on runtime 
needs. This means, for every lock allocated, a usage count is kept up to date 
to decide when the lock is no longer referenced. 

The current performance of LockManager is limited by the cost of maintaining 
individual lock liveness, aka, counting how many concurrent usages to a lock 
and removing it from the lock table when it's no longer used.

This cost mainly incurs from the need to *synchronize* all the concurrent 
access to every lock (or technically, a ConcurrentHashMap section) when:
 # When getting the lock to obtain: create a lock object if it's not existing 
in the table and increase the lock usage count.
 # When releasing the lock: decrease the usage count and remove the lock when 
the usage count is 0.

!Screenshot 2023-06-05 at 4.31.14 PM.png|width=764,height=243!

This synchronization is done internally inside ConcurrentHashMap's two methods: 
_compute_ and {_}computeIfPresent{_}.

This synchronization creates a bottleneck when multiple threads try to obtain 
and release the same lock, even for read locks.
h2. Experiment

I did an experiment of pure OM key reads in the same buckets with 100 reader 
threads. The freon command looks like the following:
{code:java}
ozone freon ockrw -v duong -b obs --contiguous -n 50000000 -p zerobytes 
--percentage-read=100 -r 5000000 -s 0 --size 0 -m -t 100 {code}
With the current code, the total OPPS tops at ~100K and getKeyInfo latency is 
~800μs. The time taken to get lock for obtaining and to release the lock is 
~40μs. Not that for each getKeyInfo request, OM obtains and releases volume and 
bucket locks multiple times. 

With a [quick and naive 
change|https://github.com/duongkame/ozone/commit/91c2729cef1649f038b1560d260d41f91de4c0b2]
 to remove the synchronization when getting and releasing the locks, the 
getKeyInfo latency drops to ~400μs and total OPPS raised to ~160K. Time to get 
and release lock drops to 4-5μs. Please note that this is just to demonstrate 
the impact of the synchronization and not a practical change, as the 
synchronization is substantial to the dynamic lock management. 
h2. Proposed solution

 

  was:
h2. Problem

Today, OM manages resource (volume, bucket) locks by LockManager, a component 
enclosing a lock table as a ConcurrentHashMap to store all active locks. Locks 
are dynamically allocated and destroyed in the lock table based on runtime 
needs. This means, for every lock allocated, a usage count is kept up to date 
to decide when the lock is no longer referenced. 

The current performance of LockManager is limited by the cost of maintaining 
individual lock liveness, aka, counting how many concurrent usages to a lock 
and removing it from the lock table when it's no longer used.

This cost mainly incurs from the need to *synchronize* all the concurrent 
access to every lock (or technically, a ConcurrentHashMap section) when:
 # When getting the lock to obtain: create a lock object if it's not existing 
in the table and increase the lock usage count.
 # When releasing the lock: decrease the usage count and remove the lock when 
the usage count is 0.

!Screenshot 2023-06-05 at 4.31.14 PM.png|width=764,height=243!

This synchronization is done internally inside ConcurrentHashMap's two methods: 
_compute_ and {_}computeIfPresent{_}.

This synchronization creates a bottleneck when multiple threads try to obtain 
and release the same lock, even for read locks.
h2. Experiment

I did an experiment of pure OM key reads in the same buckets with 100 reader 
threads. The freon command looks like the following:
{code:java}
ozone freon ockrw -v duong -b obs --contiguous -n 50000000 -p zerobytes 
--percentage-read=100 -r 5000000 -s 0 --size 0 -m -t 100 {code}
With the current code, the total OPPS tops at ~100K and getKeyInfo latency is 
~800μs. The time taken to get lock for obtaining and to release the lock is 
~40μs. Not that for each getKeyInfo request, OM obtains and releases volume and 
bucket locks multiple times. 

With a [quick and naive 
change|https://github.com/duongkame/ozone/commit/91c2729cef1649f038b1560d260d41f91de4c0b2]
 to remove the synchronization when getting and releasing the locks, the 
getKeyInfo latency drops to ~400μs and total OPPS raised to ~160K. Time to get 
and release lock drops to 4-5μs. Please note that this is just to demonstrate 
the impact of the synchronization and not a practical change. 
h2. Proposed solution

 


> OM lock performance improvement
> -------------------------------
>
>                 Key: HDDS-8765
>                 URL: https://issues.apache.org/jira/browse/HDDS-8765
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Duong
>            Priority: Major
>         Attachments: Screenshot 2023-06-05 at 4.31.14 PM.png
>
>
> h2. Problem
> Today, OM manages resource (volume, bucket) locks by LockManager, a component 
> enclosing a lock table as a ConcurrentHashMap to store all active locks. 
> Locks are dynamically allocated and destroyed in the lock table based on 
> runtime needs. This means, for every lock allocated, a usage count is kept up 
> to date to decide when the lock is no longer referenced. 
> The current performance of LockManager is limited by the cost of maintaining 
> individual lock liveness, aka, counting how many concurrent usages to a lock 
> and removing it from the lock table when it's no longer used.
> This cost mainly incurs from the need to *synchronize* all the concurrent 
> access to every lock (or technically, a ConcurrentHashMap section) when:
>  # When getting the lock to obtain: create a lock object if it's not existing 
> in the table and increase the lock usage count.
>  # When releasing the lock: decrease the usage count and remove the lock when 
> the usage count is 0.
> !Screenshot 2023-06-05 at 4.31.14 PM.png|width=764,height=243!
> This synchronization is done internally inside ConcurrentHashMap's two 
> methods: _compute_ and {_}computeIfPresent{_}.
> This synchronization creates a bottleneck when multiple threads try to obtain 
> and release the same lock, even for read locks.
> h2. Experiment
> I did an experiment of pure OM key reads in the same buckets with 100 reader 
> threads. The freon command looks like the following:
> {code:java}
> ozone freon ockrw -v duong -b obs --contiguous -n 50000000 -p zerobytes 
> --percentage-read=100 -r 5000000 -s 0 --size 0 -m -t 100 {code}
> With the current code, the total OPPS tops at ~100K and getKeyInfo latency is 
> ~800μs. The time taken to get lock for obtaining and to release the lock is 
> ~40μs. Not that for each getKeyInfo request, OM obtains and releases volume 
> and bucket locks multiple times. 
> With a [quick and naive 
> change|https://github.com/duongkame/ozone/commit/91c2729cef1649f038b1560d260d41f91de4c0b2]
>  to remove the synchronization when getting and releasing the locks, the 
> getKeyInfo latency drops to ~400μs and total OPPS raised to ~160K. Time to 
> get and release lock drops to 4-5μs. Please note that this is just to 
> demonstrate the impact of the synchronization and not a practical change, as 
> the synchronization is substantial to the dynamic lock management. 
> h2. Proposed solution
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to