[ 
https://issues.apache.org/jira/browse/IGNITE-27871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Valuyskiy updated IGNITE-27871:
------------------------------------
    Description: 
h1. Impact

Users observe high latency under load when executing Compute tasks that are 
present in node classpath while {*}peerClassLoadingEnabled=true{*}. The issue 
is amplified in multi-node clusters and with concurrent executions (thin client 
calling task by name).
h1. Root causes (2 related parts)

*#1*

Local lookup misses cache: local deployment metadata is created without 
classLoader/classLoaderId, so *GridDeploymentLocalStore#deployment(meta)* can’t 
match cached deployments and execution repeatedly falls back to 
*GridDeploymentLocalStore#deploy()* even when deployment already exists.

*#2*

Contention in *deploy()* under load: *GridDeploymentLocalStore#deploy()* is 
synchronized and, in the common reuse scenario, performs an O(N) scan over 
*cache.values()* to locate an existing deployment by ClassLoader. Under high 
concurrency this leads to lock contention.
h1. Expected

Once a task is available locally, subsequent executions should reuse cached 
deployment with minimal synchronization overhead.
h1. Actual

Repeated fallback to synchronized *deploy()* and expensive scanning causes 
contention and high latency.
h1. Proposed fix

*#1*

Ensure local deployment lookup metadata includes enough information (class 
loader and/or loader id) to allow cache hits for locally available tasks.

*#2*

Optimize *GridDeploymentLocalStore#deploy()* reuse path by using a ClassLoader 
-> deployment/deps index instead of scanning *cache.values()* under mux.

  was:
When a Compute task is executed from a thin client and the task class is 
available in the node classpath (e.g. placed in libs directory),
*GridDeploymentManager#getLocalDeployment* creates *GridDeploymentMetadata* 
without *classLoader* and {*}classLoaderId{*}.

*GridDeploymentLocalStore#getDeployment(meta)* first attempts to find an 
existing deployment via {*}deployment(meta){*}. However, *deployment(meta)* 
matches cached deployments only if:
{code:java}
dep.classLoaderId() == meta.classLoaderId() || dep.classLoader() == 
meta.classLoader(){code}
Since both *meta.classLoader* and *meta.classLoaderId* are null, the cached 
local deployment can never be matched.

As a result, *GridDeploymentLocalStore#deploy(...)* is invoked on every task 
execution. This method is synchronized and performs additional lookup and 
bookkeeping logic, which introduces unnecessary contention and latency under 
high load.

The issue is reproducible with:
 * peerClassLoadingEnabled = true
 * task class present in node classpath (libs)
 * thin client executing the same task repeatedly by name

However, when {*}peerClassLoadingEnabled=false{*}, *GridDeploymentManager* 
initializes *locDep* and reuses it directly, bypassing 
{*}GridDeploymentLocalStore{*}, which avoids this problem.


> High latency for locally deployed tasks when peerClassLoadingEnabled=true 
> (deployment lookup contention)
> --------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-27871
>                 URL: https://issues.apache.org/jira/browse/IGNITE-27871
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Oleg Valuyskiy
>            Assignee: Oleg Valuyskiy
>            Priority: Major
>              Labels: ise
>
> h1. Impact
> Users observe high latency under load when executing Compute tasks that are 
> present in node classpath while {*}peerClassLoadingEnabled=true{*}. The issue 
> is amplified in multi-node clusters and with concurrent executions (thin 
> client calling task by name).
> h1. Root causes (2 related parts)
> *#1*
> Local lookup misses cache: local deployment metadata is created without 
> classLoader/classLoaderId, so *GridDeploymentLocalStore#deployment(meta)* 
> can’t match cached deployments and execution repeatedly falls back to 
> *GridDeploymentLocalStore#deploy()* even when deployment already exists.
> *#2*
> Contention in *deploy()* under load: *GridDeploymentLocalStore#deploy()* is 
> synchronized and, in the common reuse scenario, performs an O(N) scan over 
> *cache.values()* to locate an existing deployment by ClassLoader. Under high 
> concurrency this leads to lock contention.
> h1. Expected
> Once a task is available locally, subsequent executions should reuse cached 
> deployment with minimal synchronization overhead.
> h1. Actual
> Repeated fallback to synchronized *deploy()* and expensive scanning causes 
> contention and high latency.
> h1. Proposed fix
> *#1*
> Ensure local deployment lookup metadata includes enough information (class 
> loader and/or loader id) to allow cache hits for locally available tasks.
> *#2*
> Optimize *GridDeploymentLocalStore#deploy()* reuse path by using a 
> ClassLoader -> deployment/deps index instead of scanning *cache.values()* 
> under mux.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to