On Wed, May 25, 2011 at 06:57:38AM -0700, eric smith wrote:
>
> (...)
>
> The error condition itself is that we have a process with a workitem
> in position 0 but the workitem does not exist (e.g. value of
> stored_workitem is []). No errors are generated for the process. I
> noticed in ruote-kit that there is some logic to handle the dispay of
> this condition ( processes.html.haml ).
>
> %td
> - process.position.each do |pos|
> - stored_wi = process.stored_workitems.find { |wi|
> wi.fei.sid == pos[0] }
> - text = "#{pos[1]} #{pos[2]['task']}"
> - if stored_wi
> = alink(:workitems, stored_wi.fei.sid, :text => text)
> - else
> &= text
Hello Eric,
this is a generic case, not all participants are storage participants, not
having a stored workitem is rather the norm, while having one is, well, a sign
you're using a variation of a storage participant.
(see further down for more about this).
> The net effect is that the process fails silently.
Sorry, it's not failing [silently], it's waiting for a reply from the
participant.
(upon re-reading, I realize you store via super(workitem) after the mailing
step..., what don't you put a timeout around the mailer ?)
> We are still
> looking for the root cause of the defect and why it is cascading thru
> ruote in this way.
Sorry again, it is not cascading.
> But the interesting thing is that the process is
> now hung, Obviously writing better code for the participant is the
> best solution but should ruote know that it is carrying around a dead
> process?
What about using a timeout ?
participant :ref => 'toto', :timeout => '2h'
participant :ref => 'toto', :timeout => '2h', :on_timeout => 'error'
sequence :timeout => '2h' do
participant 'alfred'
participant 'bob'
end
http://ruote.rubyforge.org/common_attributes.html#timeout
http://ruote.rubyforge.org/common_attributes.html#on_timeout
concurrence :count => 1, :remaining => :cancel do
sequence do
alfred
bob
end
sequence do
wait '3h'
echo "time out..."
end
end
> I see a need to be able to validate storage and its processes to
> ensure that there is not a process in a hung state or a likely hung
> state. Most of our participant should last less than a minute so it
> should be pretty easy to build a monitor to look for long/hung
> workitems. or processes that have no valid workitems. I think we can
> extend ruote-kit to provide a storage status page to give some
> rudimentary information about how ruote is feeling.
Ruote need your help to answer the question "is that process hung ?". By itself
it cannot answer that question. "I'm just waiting for the participant to reply".
Hence the "timeout".
Note that you can forego using timeout and periodically "poke" your hung
processes.
See
http://ruote.rubyforge.org/process_administration.html
especially
http://ruote.rubyforge.org/process_administration.html#re_applying_stalled
A variation
engine.launch_single(Ruote.define 'unstucker' do
cron '5 0 * * *' do # every night, five minutes after midnight
participant 'process_unstucker' # or something like that
end
end)
> So my questions:
>
> 1) What is the best ( or most definitive) way to determine if a
> participant is currently consuming or canceling a participant. It does
> not seem like the participant state is recorded in the process.
The [participant] expression itself has a state. It's an attribute, during
normal operation its value is nil. When an expression is getting cancelled, its
value is "cancelling". There is also "failed" and "timing_out".
engine.process(wfid).expressions.each do |exp|
p [ exp.fei.to_s, exp.state ]
end
> 2) We are mirroring workitems to a active record model which cause us
> to go thru a few extra gyrations, We do this by building a base
> storage participant that sync the workitems and having all other
> participants inherit from it.
>
> Our base participant: https://gist.github.com/990909
>
> Is there a better way to accomplish the same thing that would
> be more fault tolerant?
Mirroring indeed, what storage are you using ? You could using
Ruote::StorageParticipant and only store via ActiveRecord. Or you could let
ruote-sequel or ruote-dm do the storage, while doing further manipulation via
ActiveRecord (it can read any table when handled in a persuasive way).
I see nothing wrong with your actual scheme, apart from what you admitted, you
trusted the mailer a bit too much.
It's weird that your participant goes
a) store in AR
b) send mail
c) store in storage (super(workitem))
going a -> c -> b would solve the pseudo-issue mentioned at the top of this
reply.
> 3.) Are there other obvious conditions we should look for in monitor/
> storage validator?
errors and stuck processes, I think the obvious cases are covered.
> 4.) Is it possible to have a participant have an affinity for a
> particular worker ( or the other way around)?
Yes, but your process could get stuck if all the preferred workers are down.
You add an #accept?(workitem) method to your participant, where you reply true
or false. Reply false if the current worker is not suitable.
http://ruote.rubyforge.org/implementing_participants.html#accept
...
Upon final reading of your email and my reply, I think that wrapping your
mailer call with a timeout (and some reaction block) is a must. You know it can
go wrong and you know how to deal with it.
Best regards,
--
John Mettraux - http://jmettraux.wordpress.com
--
you received this message because you are subscribed to the "ruote users" group.
to post : send email to [email protected]
to unsubscribe : send email to [email protected]
more options : http://groups.google.com/group/openwferu-users?hl=en