[
https://issues.apache.org/jira/browse/IGNITE-27871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Oleg Valuyskiy updated IGNITE-27871:
------------------------------------
Description:
h1. Impact
Users observe high latency under load when executing Compute tasks that are
present in node classpath while {*}peerClassLoadingEnabled=true{*}. The issue
is amplified in multi-node clusters and with concurrent executions (thin client
calling task by name).
h1. Root causes (2 related parts)
*#1*
Local lookup misses cache: local deployment metadata is created without
classLoader/classLoaderId, so *GridDeploymentLocalStore#deployment(meta)* can’t
match cached deployments and execution repeatedly falls back to
*GridDeploymentLocalStore#deploy()* even when deployment already exists.
*#2*
Contention in *deploy()* under load: *GridDeploymentLocalStore#deploy()* is
synchronized and, in the common reuse scenario, performs an O(N) scan over
*cache.values()* to locate an existing deployment by ClassLoader. Under high
concurrency this leads to lock contention.
h1. Behaviour
* *Expected:* Once a task is available locally, subsequent executions should
reuse cached deployment with minimal synchronization overhead.
* *Actual:* Repeated fallback to synchronized *deploy()* and expensive
scanning causes contention and high latency.
h1. Proposed fix
*#1*
Ensure local deployment lookup metadata includes enough information (class
loader and/or loader id) to allow cache hits for locally available tasks.
*#2*
Optimize *GridDeploymentLocalStore#deploy()* reuse path by using a ClassLoader
-> deployment/deps index instead of scanning *cache.values()* under mux.
was:
h1. Impact
Users observe high latency under load when executing Compute tasks that are
present in node classpath while {*}peerClassLoadingEnabled=true{*}. The issue
is amplified in multi-node clusters and with concurrent executions (thin client
calling task by name).
h1. Root causes (2 related parts)
*#1*
Local lookup misses cache: local deployment metadata is created without
classLoader/classLoaderId, so *GridDeploymentLocalStore#deployment(meta)* can’t
match cached deployments and execution repeatedly falls back to
*GridDeploymentLocalStore#deploy()* even when deployment already exists.
*#2*
Contention in *deploy()* under load: *GridDeploymentLocalStore#deploy()* is
synchronized and, in the common reuse scenario, performs an O(N) scan over
*cache.values()* to locate an existing deployment by ClassLoader. Under high
concurrency this leads to lock contention.
h1. Expected
Once a task is available locally, subsequent executions should reuse cached
deployment with minimal synchronization overhead.
h1. Actual
Repeated fallback to synchronized *deploy()* and expensive scanning causes
contention and high latency.
h1. Proposed fix
*#1*
Ensure local deployment lookup metadata includes enough information (class
loader and/or loader id) to allow cache hits for locally available tasks.
*#2*
Optimize *GridDeploymentLocalStore#deploy()* reuse path by using a ClassLoader
-> deployment/deps index instead of scanning *cache.values()* under mux.
> High latency for locally deployed tasks when peerClassLoadingEnabled=true
> (deployment lookup contention)
> --------------------------------------------------------------------------------------------------------
>
> Key: IGNITE-27871
> URL: https://issues.apache.org/jira/browse/IGNITE-27871
> Project: Ignite
> Issue Type: Bug
> Reporter: Oleg Valuyskiy
> Assignee: Oleg Valuyskiy
> Priority: Major
> Labels: ise
>
> h1. Impact
> Users observe high latency under load when executing Compute tasks that are
> present in node classpath while {*}peerClassLoadingEnabled=true{*}. The issue
> is amplified in multi-node clusters and with concurrent executions (thin
> client calling task by name).
> h1. Root causes (2 related parts)
> *#1*
> Local lookup misses cache: local deployment metadata is created without
> classLoader/classLoaderId, so *GridDeploymentLocalStore#deployment(meta)*
> can’t match cached deployments and execution repeatedly falls back to
> *GridDeploymentLocalStore#deploy()* even when deployment already exists.
> *#2*
> Contention in *deploy()* under load: *GridDeploymentLocalStore#deploy()* is
> synchronized and, in the common reuse scenario, performs an O(N) scan over
> *cache.values()* to locate an existing deployment by ClassLoader. Under high
> concurrency this leads to lock contention.
> h1. Behaviour
> * *Expected:* Once a task is available locally, subsequent executions should
> reuse cached deployment with minimal synchronization overhead.
> * *Actual:* Repeated fallback to synchronized *deploy()* and expensive
> scanning causes contention and high latency.
> h1. Proposed fix
> *#1*
> Ensure local deployment lookup metadata includes enough information (class
> loader and/or loader id) to allow cache hits for locally available tasks.
> *#2*
> Optimize *GridDeploymentLocalStore#deploy()* reuse path by using a
> ClassLoader -> deployment/deps index instead of scanning *cache.values()*
> under mux.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)