Hi Jens,

I am sorry that nobody replied to you for two weeks. The Mozart's development team is quite small now. Moreover, we no longer struggle for maintaining the distribution part of Mozart. In fact Boris Mejias and I (UCL) are currently reimplementing it.

We are rebuilding the distribution layer of Mozart on a new language- independent distribution subsystem (DSS). The DSS has been done at SICS, and is a kind of factorization of the current Mozart distribution. We are also designing a new Fault module, with a much simpler interface. And hopefully less bugs, too ;-)

Now coming back to the question,

Jens Grabarske wrote:
The general idea is that you have little computation servers running around who get their tasks by a central instance (the so-called Master). The master gets the tickets given to him by the clients, uses Connection.take to connect to them, gives them something to do (actually the name of a file with stuff to do) and then goes back to sleep again.

Going back to sleep means: essentially the master uses Time.repeat to wake up at certain intervals to see whether he is needed (I built a poor man's cron).

Side question: Is there a reason for the master to connect to its clients, instead of the clients connecting to the master? The latter looks simpler to me. You don't need any sleep/wakeup mechanism.

Just to give you the idea: let the master publish a ticket to a port. Each client connects to the master port via the unique ticket, and sends a message to get something to do. The master simply reads the messages on its port, and sends back tasks to clients. The message may contain a free variable, or a port to reply on. When a client completes its task, it sends a message to the master to get a new one.

Now this all works like a charm - until I kill off one of the clients. Instead of ignoring this, the Master freezes - somehow the procedure the Time.repeat is supposed to trigger doesn't work anymore. Meaning: nothing happens, the Master stops working.

Can you stop the Master with Control-C? If you can't, it means Mozart has really crashed.

I tried something like:

   _ = {Fault.defaultDisable}

but this doesn't seem to solve the problem.

This will never solve the problem. It makes a thread block when a distributed operation cannot be done. Try the following instead, at least you should see an exception if things are broken:

   {Fault.defaultEnable [tempFail permFail] _}

So, the big question is:

1. Obviously he doesn't like it that he took the ticket of a process that got busted. He should be tolerant to this, actually, he shouldn't take notice at all - how can one accomplish that? (for future use it will be nice to see whether there are problems with the clients, but for now, he can just ignore the state of them).

Mmmm, Connection.take should raise an exception in such a case.

2. The connection established with Connection.take can't be accessed anymore after he gave the client the task list. Why didn't the garbage collector get rid of it?

This means that both sites are still sharing some language entity. It can be anything like a variable, a port, an object.


Please try the hints I gave to you, and keep us informed.

Cheers,
raph

_________________________________________________________________________________
mozart-users mailing list                               [EMAIL PROTECTED]
http://www.mozart-oz.org/mailman/listinfo/mozart-users

Reply via email to