Re: [ruote:2671] Re: Problem of resumption after crash

John Mettraux Mon, 04 Oct 2010 19:14:15 -0700

On Mon, Oct 04, 2010 at 06:47:18PM -0700, Eric Platon wrote:
> 
> I guess the latest solution will work in many cases, but it also looks
> like a mine that can be hard to detect in the future when debugging…
> 
> The problem pertains to several engines using the same storage---and
> perhaps to the wfid generation scheme (?).


Hello Eric,

thanks for taking the time to reflect on those issues, it's welcome. (I've seen 
that you've worked with distributed systems a lot).

First let me try to fix some vocabulary issues. They are induced by the change 
from ruote pre-2.1.x to ruote 2.1.x.

A ruote engine is now no more than a dashboard to a ruote system. A ruote 
system is a set with 1 storage and 1+ workers.

The workers fetch msgs and schedules from the storage and execute the msgs 
immediately, and the schedules if it's the time. The first step before 
execution is a second reservation step. So the bottleneck of the system is the 
storage, whose correct implementation is required to avoid collisions.


> There are two situations then:
>
> 1) The engines work on the same wfid (meaning they need to share data)
> => It does not make sense (imho). The worker abstraction would be used
> on a single engine to that end. It may move the problem to workers…
>
> 2) The engines work on different wfids => They can use the same
> storage, but for different records. They may share process definitions
> in the storage, but they do not share process instances owing to their
> independence.

The wfid generator is implemented so that not two engines (dashboards) can draw 
the same wfid, this is backed by the storage.

The engines (dashboards) use the same storage. If two engines use different 
storages, they are part of two different ruote systems.


> That may lead to restructuring the storage or, just a guess, labeling
> process instances with an engine id so as to filter them.

Unfortunately, "engine id" should be renamed to "ruote system id" to be more 
accurate.

I haven't changed it, since it's seldom used. Most of the people go with one 
engine (one ruote system), but with some googling you'll find me suggesting 
multi-system deployments quite a few times.

On the other side of the spectrum, you'll find people running 1 ruote system 
for 1 process instance (it's an interesting idea, after all, that's how we use 
classical interpreters).


> You referred to the case where the process dies. I am not sure to get
> it properly, but a dead process still leaves instance information in
> the storage. Assuming there is an engine id in that information, can't
> we do something that way? One problem considering the current commit
> (@0880deb) is that the above requires extending the cloche#get
> function to include filters. Is that a way to engage? I do not know
> well document-based data layers...

I'm now not sure if your question still applies after my clarification.

If a 'single' process is stuck in an error or stuck because a participant is 
not responding, relaunching it will probably end up in a new stuck process, 
that's why I consider that if engine.process(wfid) returns something (a process 
exist), I should not re-attempt to launch.

The administrator of the system has to detect the issue (engine.errors) and 
find a solution for it (IMHO).


I hope it's not too confusing, cheers,

-- 
John Mettraux - http://jmettraux.wordpress.com

-- 
you received this message because you are subscribed to the "ruote users" group.
to post : send email to [email protected]
to unsubscribe : send email to [email protected]
more options : http://groups.google.com/group/openwferu-users?hl=en

Re: [ruote:2671] Re: Problem of resumption after crash

Reply via email to