Is there are way to reduce the need for tasks to run on same slave? I 
suspect the issue is having data from the last run - if that is the case, 
is there any shared storage solution that may reduce the time difference? 
If you can reduce the need for binding tasks to specific nodes, you bypass 
your entire headache.

As for your other approaches, 

A minor point is that you may consider that since a node can have multiple 
labels, your nodes can have individual labels AND a shared label - meaning 
your fallback can be shared among all the existing nodes.

But more to the point, if your main issue is that you are worried that a 
node may be unavailable, you may consider some automatic node allocation. I 
am not sure if there are other examples, but for example the AWS node 
allocation can automatically allocate a new node if no threads are 
available for a label. That may be a decent backup strategy. If you are not 
using AWS - you can probably look if there is another node provisioning 
plugin that fits or if not, look at how they do that and write your own 
plugin to do it

But maybe I am overthinking it. In the end, if your primary concern is that 
node may be down - remember that pipeline is groovy code - groovy code that 
has access to the Jenkins API/internals. You can write some code that will 
check the state of the slaves and select a label to use before you even get 
to the node() statement. Sure, that will not fix the issue of a node going 
down in a middle of a job, but may catch the job before it assigns a task 
to a dead node.

Alternatively, you can simply write another job, in lieu of a plugin, that 
will scan all your tasks and nodes and if it detects a node down and a task 
waiting for it, assign the label to another node from the "standby" pool

I realize all of this sounds hacky. I would really consider that the first 
and foremost task would be to figure out if you can bypass the problem in 
the first place.

-M



On Friday, October 28, 2016 at 8:06:15 PM UTC-7, John Calsbeek wrote:
>
> We have a problem trying to get more control over how the node() decides 
> what node to allocate an executor on. Specifically, we have a situation 
> where we have a pool of nodes with a specific label, all of which are 
> capable of executing a given task, but with a strong preference to run the 
> task on the same node that ran this task before. (Note that these tasks are 
> simply different pieces of code within a single pipeline, running in 
> parallel.) This is what Jenkins does normally, at job granularity, but as 
> JENKINS-36547 <https://issues.jenkins-ci.org/browse/JENKINS-36547> says, 
> all tasks scheduled from any given pipeline will be given the same hash, 
> which means that the load balancer has no idea which tasks should be 
> assigned to which node. In our situation, only a single pipeline ever 
> assigns jobs to this pool of nodes.
>
> So far we have worked around the issue by assigning a different label to 
> each and every node in the pool in question, but this has a new issue: if 
> any node in that pool goes down for any reason, the task will not be 
> reassigned to any other node, and the whole pipeline will hang or time out.
>
> We have worked around *that* by assigning each task to "my-node-pool-# || 
> my-node-pool-fallback", where my-node-pool-fallback is a label which 
> contains a few standby nodes, so that if one of the primary nodes goes down 
> the pipeline as a whole can still complete. It will be slower (these tasks 
> can take two to ten times longer when not running on the same node they ran 
> last time), but it will at least complete.
>
> Unfortunately, the label expression doesn't actually mean "first try to 
> schedule on the first node in the OR, then use the second one if the first 
> one is not available." Instead, there will usually be some tasks that 
> schedule on a fallback node even if the node they are "assigned" to is 
> still available. As a result, almost every run of this pipeline ends up 
> taking the worst-case time: it is likely that *some* task will wander 
> away from its assigned node to run on a fallback, which leads the fallback 
> nodes to be over-scheduled and leaves other nodes sitting idle.
>
> The question is: what are our options? One hack we've considered is 
> attempting to game the scheduler by using sleep()s: initially schedule all 
> the fallback nodes with a task that does nothing but sleep(), then schedule 
> all our real tasks (which will now go to their assigned machines whenever 
> possible, because the fallback nodes are busy sleeping), and finally let 
> the sleeps complete so that any tasks which couldn't execute on their 
> assigned machines now execute on the fallbacks. A better solution would 
> probably be to create a LoadBalancer plugin that codifies this somehow: 
> preferentially scheduling tasks only on their assigned label, scheduling on 
> fallbacks only after 30 seconds or a minute.
>
> Is anyone out there dealing with similar issues, or know of a solution 
> that I have overlooked?
>
> Thanks,
> John Calsbeek
>

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-users/7db38de9-59ba-4b77-9311-79116d91c38b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to