[Alchemi-developers] Re: Alchemi-developers digest, Vol 1 #100 - 8 msgs

Krishna Thu, 26 Jan 2006 16:15:10 -0800

Hi All,

A few comments about Alchemi resilience features mentioned below :::

Scheduler Resilience
- Use heuristic (user-supplied?) within GThread class to estimate when a
thread has taken too long.  Re-launch with greater running time
allowance until completion (or until max. # launches exceeded)

Yes, good idea. However, this will need to be a seperate scheduler. Theuser/admin on the Manager can be given a number of scheduling options,of which one will be chosen.We can of course have a default, which is what gets used as soon as theManager is installed, and the user has not yet customised any settings.

- Multiple instances of same thread on different executors would be
great, esp. if # instances is an option in Console

Again, a seperate scheduler for this would be better, than pumping allthese new features into the existing one. Having worked with adistributed systems, and brokers with different schedulers,I can say the simpler each module is, the better... And ideally theseschedulers should be more or less independent bits that are not tootightly coupled with the rest of the system. Currently the design allowssuch an implementation : since there is a clean Scheduler interface,which custom schedulers can implement and there will be a niceseperation between the scheduler module and the rest of the system.

- Add Round Robin scheduling soon to allow use of multiple cpu-intensive
applications without a FIFO execution

Similar to the above....

Executor Resilience
- Thread to monitor state of Executor within Manager, could use frequent
pings or some other message system.  One thread in Manager talks to
Responder thread in Executor.

This already exists: Pls see the IExecutor interface ::: there is anPing() method...which is called periodically, by a "WatchDog" thread inthe Manager...to check if the Executor is alive.

Scheduler features that improve resilience:
- schedule a thread on multiple executors, take the response from the
first
one. This improves the chances of a thread being executed.
- schedule a thread on multiple executors and compare the results before
returning to the application. This improves the quality of the
computation
done by executors and weeds out executors that corrupt data.
- wait a given amount of time for a thread to be executed and if no
response
is received then re-schedule the thread on another executor. This is an
optimistic implementation of the first variation.

Again, ideal scenario would be to implement seperate schedulers whichprovide redundant scheduling for robustness.

Executor features that improve resilience:
- if the executor is shut down nicely release the running thread back to
the
manager.
- if the executor is killed then release the running thread back to the
manager on startup.
- if connection to the manager is lost due to network issues, the
manager
being down or whatever then re-connect and continue working on the
existing
thread.

All the features mentioned here, already exist :). Pls have a look atthe GExecutor implementation.

Manager features that improve resilience:
- detect dead executors and re-schedule their threads.
- detect dead applications and stop running their threads.

As far as I know, this exists as well, in the Manager.

The main problem why things seem to be broken, according to me, is thatthe GApplication class behaves differently depending on how it isinstantiated:

1. Single - use
2. Multi - use.

The very meaning of multi-use is " it can be used multiple times, andeven indefinitely."This means, the App_finished event is meaningless in this situation. Iunderstand, (and have implemented it based on my understanding), thatusers want to keep adding threads to a re-usable / multi-use GApp, andit keeps running those threads, as and when they are added. This means,the app really never ends.However, the "Status" flag of the GApp should then be properly exposedto show the "current" state of the GApp.- If a GApp has any threads that are running, then it would beconsidered an active/running GApp, otherwise the GApp is stopped / finished.

I have not checked the latest CVS fix, which Tibor has mentioned in arecent message...but this is what I see as being a solution.


Is this what you have fixed Tibor?

Cheers
Krishna.



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Alchemi-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/alchemi-developers

[Alchemi-developers] Re: Alchemi-developers digest, Vol 1 #100 - 8 msgs

Reply via email to