Hi Jens,
I am sorry that nobody replied to you for two weeks. The Mozart's
development team is quite small now. Moreover, we no longer struggle
for maintaining the distribution part of Mozart. In fact Boris Mejias
and I (UCL) are currently reimplementing it.
We are rebuilding the distribution layer of Mozart on a new language-
independent distribution subsystem (DSS). The DSS has been done at
SICS, and is a kind of factorization of the current Mozart distribution.
We are also designing a new Fault module, with a much simpler
interface. And hopefully less bugs, too ;-)
Now coming back to the question,
Jens Grabarske wrote:
The general idea is that you have little computation servers running around
who get their tasks by a central instance (the so-called Master). The master
gets the tickets given to him by the clients, uses Connection.take to connect
to them, gives them something to do (actually the name of a file with stuff
to do) and then goes back to sleep again.
Going back to sleep means: essentially the master uses Time.repeat to wake up
at certain intervals to see whether he is needed (I built a poor man's cron).
Side question: Is there a reason for the master to connect to its
clients, instead of the clients connecting to the master? The latter
looks simpler to me. You don't need any sleep/wakeup mechanism.
Just to give you the idea: let the master publish a ticket to a port.
Each client connects to the master port via the unique ticket, and sends
a message to get something to do. The master simply reads the messages
on its port, and sends back tasks to clients. The message may contain a
free variable, or a port to reply on. When a client completes its task,
it sends a message to the master to get a new one.
Now this all works like a charm - until I kill off one of the clients. Instead
of ignoring this, the Master freezes - somehow the procedure the Time.repeat
is supposed to trigger doesn't work anymore. Meaning: nothing happens, the
Master stops working.
Can you stop the Master with Control-C? If you can't, it means Mozart
has really crashed.
I tried something like:
_ = {Fault.defaultDisable}
but this doesn't seem to solve the problem.
This will never solve the problem. It makes a thread block when a
distributed operation cannot be done. Try the following instead, at
least you should see an exception if things are broken:
{Fault.defaultEnable [tempFail permFail] _}
So, the big question is:
1. Obviously he doesn't like it that he took the ticket of a process that got
busted. He should be tolerant to this, actually, he shouldn't take notice at
all - how can one accomplish that? (for future use it will be nice to see
whether there are problems with the clients, but for now, he can just ignore
the state of them).
Mmmm, Connection.take should raise an exception in such a case.
2. The connection established with Connection.take can't be accessed anymore
after he gave the client the task list. Why didn't the garbage collector get
rid of it?
This means that both sites are still sharing some language entity. It
can be anything like a variable, a port, an object.
Please try the hints I gave to you, and keep us informed.
Cheers,
raph
_________________________________________________________________________________
mozart-users mailing list [EMAIL PROTECTED]
http://www.mozart-oz.org/mailman/listinfo/mozart-users