-------- Forwarded Message --------
Subject: Re: [ARTIQ] controller management
Date: Fri, 19 Dec 2014 15:11:08 -0700
From: Robert Jördens <jord...@gmail.com>
To: Sébastien Bourdeauducq <s...@m-labs.hk>

Hello,

On Sat, Dec 13, 2014 at 2:33 AM, Sébastien Bourdeauducq <s...@m-labs.hk>
wrote:
> The manager shouldn't have to be restarted unless its code is modified.
> Crashes may still happen, but it is a relatively simple and contained
> piece of software, so they should be rare.

Ack. An init system...

>> How does the configuration file get to the manager?
>> Shouldn't it get that from the device db on the master?
>
> The manager would still need configuration data in order to reach the
> master.

The managers need only a URL for the master and their own name/id/hostname.

> My idea is to keep a list of controllers to run on each machine, with
> parameters such as the TCP port, the type of controller, and the device
> serial number. In fact, the command line to run to start a controller
> (and a controller can be used without a manager). In the device database
> on the master, we would simply have entries such as "electrodes_qc_q1_0
> -> pdq2 controller on 192.168.1.25:7899". When starting an experiment,
> the master creates RPC clients and uses them as devices in the experiments.

The managers configuration file then contains redundant data.
I would just point the managers to the master where they can get their
(current, versioned) subset of the devicedb. They can cache that if
really needed.

>> How does it handle changes in the configuration file?
>
> I propose that controllers can be dynamically added and removed via a
> network API, similarly to experiment submission and cancellation on the
> master. The manager would save its configuration file itself. Manual
> modification should only be done when the manager is not running.

What I meant is how this will play with the versioning of the device
db (like the parameter db).
The (device type, device serial)->device name->controller url mapping
is closely related to the experiments and should live together with
them.
What should happen if you check out an older version of
experiments/devicedb/paramdb?

>>> When a device is unplugged, what happens is:
>>> 1) the associated controller exits (this becomes the main constraint for
>>> controller development: they must exit when their device fails in a
>>> non-recoverable manner)
>>
>> Yeah. That tends to be impossible within any asyncio if all you have
>> is a blocking API. Threaded or multiprocessing watchdog within each
>> controller or within the manager? And this might get even funnier if
>> you end up doing syscalls the can leave you in uninterruptible sleep.
>
> How is that impossible? An asyncio watchdog within a controller sounds
> feasible in many cases. I would also expect blocking syscalls that

Using a blocking API from within an event loop is broken code because
it blocks your event loop unpredictably.

> operate on the device to return with an error if the device is unplugged
> during the syscall.

Sometimes yes. Sometimes it takes a timeout outside your control to
expire before the call returns. Sometimes it just hangs.

>> I can see people wanting to have the severity of certain controllers'
>> absences lowered to warnings. Some attenuators need not be
>> controllable all the time.
>
> What about replacing them with dummies in the hardware database?

Dummies as automatic fallback instead of suppressing RPC failures? I
don't see how that would make it easier.

>> Can the "default/idle" kernel use remote controllers at all?
>
> What do you call the "default/idle" kernel? We talked about a kernel to
> be written in the flash memory of the core device and run automatically
> should the master (or the network connection) fail or otherwise become
> unavailable.

We discussed and distinguished two kernels. The default kernel would
be in flash, not require the master, and start or run whenever there
is no master or the master fails to tell it what to do.
That default kernel would only change (get updated due to changes in
the parameter database, channel overrides etc) on request (or shutdown
of the master).
The idle kernel would be there to ensure that the master's queue is
not empty. It changes when the parameter db does. It is
smarter/better/bigger than the default kernel. It spits out some basic
measurements (ion fluorescence, ion present) and uses all devices.

> It is much simpler that controller use from the core device goes through
> two layers of RPC: core device to master + master to controller. This
> can be done with the current code without any further adjunctions, and
> allows non-Ethernet protocols to be used between the core device and the
> master. In that case, the default/idle kernel cannot use controllers.

What I meant is that the default kernel must not fail if a device
disconnects or a controller hangs. There is also no reason for the
idle kernel to fail if some non-core device is disconnected.

>> A half-running experiment might be infinitely better than a non-running one.
>
> What do you mean? Should it automatically replace unreachable
> controllers with dummies, and log that those controllers were not present?

Yes. Where desired. And/or timeout the master-to-controller RPC and
return None instead of an raising an exception for these "optional"
controllers.

>> Or you
>> have written some code in the office with simulated hardware and want
>> to jump back and forth between that environment and the real one in
>> the lab a few times just by doing "git commit, push, deploy".
>
> For this sort of development activity, what I'd propose is to make the
> development directory accessible to the master, possibly via NFS of SMB.
> Then you would submit an experiment using the file name on the master
> (with full path) and bypass the regular Git-based experiment database.
> This would support imports as long as you are not modifying the ARTIQ code.

That is not sufficient if you are also developing the
hardware/simulated hardware drivers.
If you are developing a controller, the experiment and the controller
changes are correlated.
Also shared filesystems are not easy here at NIST nor will they be
techincally easy with the file locking Windows does.
Finally, since experiments require the git commit id (and a clean
working trea, otherwise the commit id is meaningless), you must/can
commit anyway.

>> Or you
>> are developing a new data processing routine that requires iterating
>> through correlated changes in the runtime, the master, the experiment,
>> the ui and a controller.
>> And it seems that partial reloads (all controllers/just the
>> master/just the UI) could become another request.
>
> Controllers can be restarted seamlessly as long as nothing is connected
> to them (and this could be done via the manager network API, and then
> restarting all controllers is doable with a simple shell script), and
> the master can operate with the UI disconnected, so restarting it should
> not be an issue.

Having all this as one function/button -- including restarting the UI,
the master, and all controllers -- is very desirable.

> This of course assumes that the controller code is upgraded on all the
> involved machines. Upgrading it automatically on multiple machines is
> quite straightforward with shell scripting, ssh and git (and it seems an
> ssh server can be run on Windows).

I would bet triggering git pull through the manager API (without an
ssh server on the managers) is easier than git push to the manager.
Or even running the git protocol over the master API.

>> The other case is simplifying bringing up _everything_ if you come
>> into the lab in the morning and find that ITAC has applied a "critical
>> patch" and rebooted half of your machines for you (because you
>> forgot).
>
> Add the controller manager as a program to start at boot? If the master
> also reattempts failing experiments until they succeed (the
> "second-simplest thing to do" above), things should automatically resume
> shortly after the reboots.

Ack.

>> Also: If I want to run a little software PID controller between an ADC
>> on one machine and a DAC on another, can I run that concurrently and
>> independently of the "real experiment" that centers around the core
>> device?
>
> Yes. The controllers do not care where the connections are coming from -
> the master or elsewhere. Registering them with managers is also
> optional, you can run them directly on the command line.
> You can then bypass the master and run your code on your local machine
> (or even on the machine where the master is running), with a local
> hardware database, by using the "artiq_run.py" utility.

It seems conceptually appropriate to be able to manage these
concurrent non-core experiments also in the master. They may want the
same scheduling API, automatic startup (like the idle kernel) as the
core experiments.

-- 
Robert Jordens.


_______________________________________________
ARTIQ mailing list
https://ssl.serverraum.org/lists/listinfo/artiq

Reply via email to