Wow, this thread is starting to get long but I think a lot of good details are coming to light and are being recorded for posterity in the mailing list.  Now if we could just compile it into on resource.  :D

If I put my comments inline they will be hard to read so I will try to query/respond by providing section number.

1.)  Krishna I think that I may be able to help with this.  I have ran into similiar issues with logging in secondary appdomains.  I don't have the code setting in front of me so excuse my ignorance but what are you using for logging currently?  No sweat about not having time for Alchemi, we all have day jobs and understand.  I'm getting ready to start at Microsoft out in Redmond at the end of March so I will be fairly busy the next couple of months.  I'll have to check out the work you are doing with Grid Broker, sounds interesting.

2.)  So Krishna what is the behavior if we have a Manager that goes belly up or communications with the net is severed.  Does all worker nodes of that manager leave the appdomain hanging until the Executor is shutdown?  If an executor is connected to several managers and their GApplication that is being run on that worker node on the Managers behalf is larger which may very well happen with the type of applications that are suitable for grid enablement this could become a prominent issue.  What I was proposing is as follows.  The ServiceManager/ExecutorController is what a Manager communicates with on a worker node..  This Controller will fire up and manage appdomains based upon a number of Managers and then start an executor in that appdomain.  These 'executors' objects are based upon MarshalByRef objects that have configurable lease lifetime on them.  The Controller acts as a bridge between the 'Manager' and the 'Executor' routing all calls to the 'Executor'.  Everytime communications happen between the two the lease lifetime is extended.  If a Manager drops off line or communications is cut for whatever reason the lease lifetime will expire for that 'Executor' and the Controller will be notified by a delegate and then clean up the 'abandoned' appdomain.  The can also be initiated by the Manager when it is done executing is work.  As I see it it is just another layer of abstraction between the Manager and Executor that allows for a little more robustness.

I think I helped Tibor out with the threading issues he was facing.  Tibor, did that work for you?

3.)  Krishna, I was thinking of a little longer caching lifetime for dlls.  So lets say one day a Manager1 needs AppA to be executed on the grid.  You have to push the dlls of that AppA down to each worker node.  You finish your work for that day, the manager notifies the worker nodes that it no longer needs their services and they clean up any executable payload that was pushed to them.  Next day Manager2 needs AppA to be executed on the grid.  Follow the same exact steps as the first day.  Now shorten the time to 12 hours, 1 hour, 1 minute.  A lot of redundant bytes could be flying around the grids topology.  If we could cache the Apps being pushed around on the worker nodes and have a manager check to see if it is on a worker node before pushing it it would make it less network intensive.  Basically all managers push an app to the controller on a worker node if it doesn't already reside there.  Then when the 'Executor' is loading up the App for the 'Manager' it is pulled from this central repository, which is in effect mutliple folders, one per app, and copies them to a shadow directory which the 'Executors' appdomains path points to.  This alows for multiple versions of the same app to be run side by side in different app domains.

4.) Security.  Krishna I agree with everything you said.  You would want exactly that level of control of security.

5.) Krishna, if I am reading between the lines correctly it would almost seem that you are talking about some sort of P2P overlay topology for the grid.  If this correct I think that it is a fantastic idea.  I have been involved in a couple of P2P apps and I would be glad to lend a hand with implementation for Alchemi.  This would allow for clustering of resources.  It would also allow for pushing through firewalls and all manner of network nastiness that can happen.  But, like I alluded to above, I'll be busy until probably mid-May getting settled in with my new employer.  After that I would be happy to contribute to the project.

Have a great day,

John


On 2/24/06, Krishna <[EMAIL PROTECTED]> wrote:
Hi John,

What Tibor said is basically correct. I have added my comments inline as
well.

Krishna.

>
>    1. In Alchemi's case the controller of an AppDomain is the Manager.
>       What happens if a Manager goes down or there is a network
>       interuption of some sort? Is the Executor in its appdomain aware
>       enough that it knows that it should reclaim its resources and
>       unload itself due to whatever failure? The way in which I handle
>       this is the controller actually resides in the AppDomain so you
>       can equate that to the Executor. If the Manager keeps pushing
>       work to the executor then the Executors lease lifetime would not
>       time out and it would not unload.
>
> */[Tibor Biro] In Alchemi the AppDomain is created on the Executor.
> One AppDomain is created per application. The AppDomain is kept alive
> until the Manager tells the Executor that the application finished. If
> multiple threads are running at the same time on the same Executor for
> the same application then the same AppDomain is used. Currently we are
> having problems if the thread inside the sandbox hangs for whatever
> reason. If the Executor just dies then the Manager re-schedules the
> thread to another Executor but if the thread hangs then nothing happens./*
>
[Krishna: ] This is correct. I tried sometime back to get some logging
out of the secondary appdomains, but havenot been able to do so. This
was a while back, and I havent been able to concentrate on Alchemi for
any significant period of time in the past 2 months, due to my focus
diverted towards our broker software. (http://www.gridbus.org/broker)
(http://sourceforge.net/projects/gridbusbroker ).

>    2. In Alchemi, and correct me if I'm wrong here, the Executor can
>       be viewed as the controller of work units, whatever those may
>       be, as you alluded to above. The executor can execute many
>       different types of work units with in a specific app domain. So
>       in effect you have an appdomain per manager work unit on each
>       node. This is a bit different than what I am doing. I 'group'
>       like work units, i.e. reports, into one appdomain and service
>       any requests for those work units from that appdomain. A 'group'
>       in Alchemi's case is a particular instance of a Manager
>       enlisting a particular node to do work on its behalf. Is this a
>       correct assumption? The approach I am using would still work in
>       this instance. The executor would have a lease lifetime
>       associated with them and would unload themselves *after* that
>       lease expires. This would allievate the Manager from having to
>       control the lifetime of the appdomain on the co-operating nodes.
>
> */[Tibor Biro] This behavior was just changed so I'll describe the
> version in CVS. Each Executor creates an AppDomain for each
> application, identified by the AppID. If a work unit comes from the
> same app ID and there is an existing AppDomain for it then that one is
> used instead of creating a new one. So all work units from the same
> app ID are grouped in the same AppDomain. By default the Executor only
> accepts one work unit at a time but it can be configured to accept
> multiple work units (this is the feature I am working on now) in which
> case it is possible to have multiple work units running at once in the
> same AppDomain./*
>
> */ I think it would be useful to have the Executor's AppDomain "time
> out" somehow. Currently the Manager has to terminate the AppDomain so
> if the Manager fails to do that the AppDomain is never unloaded.
> Another thing that sometimes happens is that an AppDomain cannot be
> unloaded, probably because of the hanging threads./*
>
[Krishna :] Tibor, I am not sure what behaviour has changed (perhaps you
are talking about multiple threads on an executor). The behaviour Tibor
has outlined, (with the exception of multiple threads), is what was
intended from the beginning. So, an AppDomain is expected to exist (or
is created otherwise) for each GApplication. All the GThreads that are
part of the same GApplication run in the same Appdomain on the Executor.
The Manager ofcourse doesnot directly control AppDomains on the
Executor. It tells the Executor when the GApp is finished, so
appropriate clean up can be performed.

>    3. Caching of work dlls. Currently Alchemi pushes the required dlls
>       down to a slave system whenever it instigates work on a node.
>       Those same dlls are purged from the system once work is complete
>       and the manager no longer needs the node to do work? If this is
>       the case have you thought about some sort of caching scheme so
>       that the managers and nodes would not have re-push the dlls if
>       they have executed this type of work before?
>
> */[Tibor Biro] Caching the DLLs would be useful. We'll need the dlls
> to be signed though so we can properly identify them./*
>
[Krishna: ] This may be an improper way of putting it but: there is some
kind of caching going on now. I mean, the dll files which are part of
the Application manifest, get copied to the Executor only once. So, in
that sense, all Gthreads which use those dlls would be able to do so,
without copying the set of dlls specified in the manifest multiple
times. So this can be considered as some caching, if you like, but if
additional capabilities are needed, we should surely look into providing
them.

> */ /*
>
>    4. Security. Right now Alchemi is pretty wide open when it comes to
>       security. This is actually a very large issue with a lot of
>       things to take into consideration. This would be easier to put a
>       box around if we could say exactly what each executor would be
>       doing and how it would be getting its data to operate on but
>       each executor could do a multitude of things. Maybe if we say
>       from an executor what a work unit would need would could lock
>       out other things. We could then, from a cooperating node, say
>       that we only want to allow access to particular resources and
>       even give the opportunity to trust particular managers fully. I
>       don't know, I am just throwing out ideas here.
>
> */[Tibor Biro] I agree that security is currently an issue. I see this
> controlled mainly from the Executor at this point. In the future a
> centralized control would be nice to have. In the end what I would
> like to see is a way to specify the rights based on several parameters
> such as where the dll is coming from, who the user is and whether the
> dll is signed or not. I am thinking about something like the .NET Code
> Access Security Policies, maybe the built-in CAS securities can be used./*
>
> Okay, that should be enough questions to help me get my mind wrapped
> around it a little tighter and see how I might contribute what I have
> or modify what I have to help out.
>
[Krishna: ] Yes, I agree. I think .Net CAS would be the way to go. In
fact, the appDomain creation code in the Executor was created with the
intention that it would be used in combination with CAS. There are even
some code comments in there to that effect (I think). Of course, this
needs to be expanded, so that additional policies, for example per user
/ per manager / per dll / per machine could be added / managed, on the
Executor, and perhaps even the Manager. I would like to have the
Executor bit done first though, so things are bascially designed from
the ground-up for a de-centralised security system, and then extended so
it can be used in a centralised way (via the Manager), if needed for
convenience. In effect, the Executor security settings should override
those set on the Manager (or rather the tighter security of the two
should be applied). This will also fall in line nicely with the concept
of autonomous resources being part of the Grid, so that if the Executor
and Manager are owned / operated by different individuals /
organisations the owner/administrator will still have full control of
his/her part of the "Grid".

I also just wanted to remind everyone that Alchemi was (and in my
opinion should continue to be) developed with the idea of scaling to a
large-ish Grid system with hetergenous, autonomous, and perhaps even
geo-graphically distributed resources, which can also work in a
cluster-like environment. I agree that the .Net remoting architecture,
lends itself more to a LAN environment, than a WAN-like one, however,
the idea is to make it so that it is not restricted to cluster / LAN
environment.

> Thanks,
>
> John
>



--
Life should NOT be a journey to the grave with the intention of arriving safely in an attractive and well preserved body, but rather to skid in sideways, paddle in one hand, beer in the other, body thoroughly used up, totally worn out and screaming "WOO HOO what a ride!"

Reply via email to