RE: [Alchemi-developers] Fault Tolerance in Alchemi

Tibor Biro Thu, 16 Mar 2006 06:15:20 -0800

"Executing thread “hangs” on the Executor. This means that the executing thread is not responding but the heartbeat thread keeps working. Currently the Executor remains in the hung state until it is restarted. This is not an acceptable solution. We have some ideas if you are interested in exploring this area."

[Tibor Biro] One idea is to monitor the executing thread and terminate it if it exceeds a configurable amount of time. This value should be configurable from the application so the user can set it but an override at the Executor level is probably desirable as well. One problem here is that some machines take longer to execute something than others so maybe if the time it waits is a factor of the computer’s speed it might be useful.

Another idea I’ve been toying with is to require long running threads to raise status events containing a “percent done” value and maybe some other custom stuff. The monitoring thread would then have data to see if the thread is dead or just taking longer to complete but still alive. The events could have enough information to be used as a checkpoints but this would be up to the implementation of each application.

Both approaches could be implemented in some mix. I wouldn’t mind exploring other ideas as well.

"In case of a Manager failure there is no backup at this point. An immediate problem here is that once a Manager goes offline all Executors that are running threads will probably fail as well. It would be nice to have the Executor store the thread’s results and wait until the Manager comes back online. Other ideas are welcome."

[Tibor Biro] The Executor should persist the executed thread in case the Manager is not available and send it back once a connection is made. You should investigate the points of failure in this scenario and address them as they are discovered.

The above mentioned areas by you are seems to be a good areas of intrests.I want to explore both of these areas in some more detail and then try to remove the above mentioned problems.. I need your help for this which part of code i need to understand before starting working on them as understanding the whole code is not easy. I want to remove the above problems as soon as possible. i am totally devoting my time on Alchemi.

[Tibor Biro] Once you decide which area to work on let’s start a thread on the SourceForge forums so we can iron out the details and document what other ideas were considered.

Good Luck!

Tibor

RE: [Alchemi-developers] Fault Tolerance in Alchemi

Reply via email to