Re: [ruote:3102] Ruote Participant Error trapping/Monitoring

John Mettraux Wed, 25 May 2011 07:29:09 -0700

On Wed, May 25, 2011 at 06:57:38AM -0700, eric smith wrote:
> 
> (...)
> 
> The error condition itself is that we have a process with a workitem
> in position 0 but the workitem does not exist (e.g. value of
> stored_workitem is []). No errors are generated for the process. I
> noticed in ruote-kit that there is some logic to handle the dispay of
> this condition ( processes.html.haml ).
> 
>       %td
>           - process.position.each do |pos|
>             - stored_wi = process.stored_workitems.find { |wi|
> wi.fei.sid == pos[0] }
>             - text = "#{pos[1]} #{pos[2]['task']}"
>             - if stored_wi
>               = alink(:workitems, stored_wi.fei.sid, :text => text)
>             - else
>               &= text


Hello Eric,

this is a generic case, not all participants are storage participants, not 
having a stored workitem is rather the norm, while having one is, well, a sign 
you're using a variation of a storage participant.

(see further down for more about this).


> The net effect is that the process fails silently.

Sorry, it's not failing [silently], it's waiting for a reply from the 
participant.

(upon re-reading, I realize you store via super(workitem) after the mailing 
step..., what don't you put a timeout around the mailer ?)


> We are still
> looking for the root cause of the defect and why it is cascading thru
> ruote in this way.

Sorry again, it is not cascading.


> But the interesting thing is that the process is
> now hung, Obviously writing better code for the participant is the
> best solution but should ruote know that it is carrying around a dead
> process?

What about using a timeout ?

  participant :ref => 'toto', :timeout => '2h'
  participant :ref => 'toto', :timeout => '2h', :on_timeout => 'error'

  sequence :timeout => '2h' do
    participant 'alfred'
    participant 'bob'
  end

  http://ruote.rubyforge.org/common_attributes.html#timeout
  http://ruote.rubyforge.org/common_attributes.html#on_timeout

  concurrence :count => 1, :remaining => :cancel do
    sequence do
      alfred
      bob
    end
    sequence do
      wait '3h'
      echo "time out..."
    end
  end


> I see a need to be able to validate storage and its processes to
> ensure that there is not a process in a hung state or a likely hung
> state. Most of our participant should last less than a minute so it
> should be pretty easy to build a monitor to look for long/hung
> workitems. or processes that have no valid workitems. I think we can
> extend ruote-kit to provide a storage status page to give some
> rudimentary information about how ruote is feeling.

Ruote need your help to answer the question "is that process hung ?". By itself 
it cannot answer that question. "I'm just waiting for the participant to reply".

Hence the "timeout".

Note that you can forego using timeout and periodically "poke" your hung 
processes.

See

  http://ruote.rubyforge.org/process_administration.html

especially

  http://ruote.rubyforge.org/process_administration.html#re_applying_stalled


A variation

  engine.launch_single(Ruote.define 'unstucker' do
    cron '5 0 * * *' do # every night, five minutes after midnight
      participant 'process_unstucker' # or something like that
    end
  end)


> So my questions:
>
> 1)    What is the best  ( or most definitive) way to determine if a
> participant is currently consuming or canceling a participant. It does
> not seem  like the participant state is recorded in the process.

The [participant] expression itself has a state. It's an attribute, during 
normal operation its value is nil. When an expression is getting cancelled, its 
value is "cancelling". There is also "failed" and "timing_out".

  engine.process(wfid).expressions.each do |exp|
    p [ exp.fei.to_s, exp.state ]
  end


> 2)    We are mirroring workitems to a active record model which cause us
> to go thru a  few extra gyrations, We do this by building a base
> storage participant  that sync the workitems and having all other
> participants inherit from it.
> 
>       Our base participant: https://gist.github.com/990909
> 
>        Is there a better way to accomplish the same thing that would
> be more fault tolerant?

Mirroring indeed, what storage are you using ? You could using 
Ruote::StorageParticipant and only store via ActiveRecord. Or you could let 
ruote-sequel or ruote-dm do the storage, while doing further manipulation via 
ActiveRecord (it can read any table when handled in a persuasive way).

I see nothing wrong with your actual scheme, apart from what you admitted, you 
trusted the mailer a bit too much.

It's weird that your participant goes

  a) store in AR
  b) send mail
  c) store in storage (super(workitem))

going a -> c -> b would solve the pseudo-issue mentioned at the top of this 
reply.


> 3.) Are there other obvious conditions we should look for in monitor/
> storage validator?

errors and stuck processes, I think the obvious cases are covered.


> 4.) Is it possible to have a participant have an affinity for a
> particular worker ( or the other way around)?

Yes, but your process could get stuck if all the preferred workers are down.

You add an #accept?(workitem) method to your participant, where you reply true 
or false. Reply false if the current worker is not suitable.

  http://ruote.rubyforge.org/implementing_participants.html#accept

...

Upon final reading of your email and my reply, I think that wrapping your 
mailer call with a timeout (and some reaction block) is a must. You know it can 
go wrong and you know how to deal with it.


Best regards,

-- 
John Mettraux - http://jmettraux.wordpress.com

-- 
you received this message because you are subscribed to the "ruote users" group.
to post : send email to [email protected]
to unsubscribe : send email to [email protected]
more options : http://groups.google.com/group/openwferu-users?hl=en

Re: [ruote:3102] Ruote Participant Error trapping/Monitoring

Reply via email to