Are you sure you don't have a threading issue? Do you ever get this problem on a hyperthreaded or multi cpu machine? The reason why I ask is I have seen the hanging thread issue before. What was happening in my case is that on a single cpu'ed machine without hyperthreading threads would appear to hang. So lets say I fired up 10 threads to handle an some work units one would run while 9 would 'seamiingly' hang. What was going on is that even though I had marked the Main entry point with the MTAThread attribute that applied only COM related apartment threading. So on the single cpu it was running the threads sequentially. So if I had 1000 items to process 991 of them would get processed. The last 9 would never return. I could bring up the threads view in the debugger and see them but they would always be waiting on their MRE's. So in the main method as a measure of desperation I put in the command of System.Threading.Thread.CurrentThread.ApartmentState = ApartmentState.MTA and everything started working. You might want to check this if you using more than one thread in the appdomain. Hope this helps.
John
Hi John,
Coincidentally I am just working or rather struggling in the app domain area. For some reason some threads just get lost in the sandboxing application domain.
I'll try to reply to your questions, Krishna will correct me if I'm wrong. Please see my comments inline.
Tibor
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of John Sheppard
Sent: Friday, February 24, 2006 12:10 PM
To: Krishna
Cc: Jonathan Mitchem; [email protected]; [email protected]
Subject: [Alchemi-users] Re: [Alchemi-developers] Grid-based Malware
Krishna,
I see the a couple of differences in the approach that we are each using. Let me ask some clarifing questions so I can better answer your questions.
- In Alchemi's case the controller of an AppDomain is the Manager. What happens if a Manager goes down or there is a network interuption of some sort? Is the Executor in its appdomain aware enough that it knows that it should reclaim its resources and unload itself due to whatever failure? The way in which I handle this is the controller actually resides in the AppDomain so you can equate that to the Executor. If the Manager keeps pushing work to the executor then the Executors lease lifetime would not time out and it would not unload.
[Tibor Biro] In Alchemi the AppDomain is created on the Executor. One AppDomain is created per application. The AppDomain is kept alive until the Manager tells the Executor that the application finished. If multiple threads are running at the same time on the same Executor for the same application then the same AppDomain is used. Currently we are having problems if the thread inside the sandbox hangs for whatever reason. If the Executor just dies then the Manager re-schedules the thread to another Executor but if the thread hangs then nothing happens.
- In Alchemi, and correct me if I'm wrong here, the Executor can be viewed as the controller of work units, whatever those may be, as you alluded to above. The executor can execute many different types of work units with in a specific app domain. So in effect you have an appdomain per manager work unit on each node. This is a bit different than what I am doing. I 'group' like work units, i.e. reports, into one appdomain and service any requests for those work units from that appdomain. A 'group' in Alchemi's case is a particular instance of a Manager enlisting a particular node to do work on its behalf. Is this a correct assumption? The approach I am using would still work in this instance. The executor would have a lease lifetime associated with them and would unload themselves *after* that lease expires. This would allievate the Manager from having to control the lifetime of the appdomain on the co-operating nodes.
[Tibor Biro] This behavior was just changed so I'll describe the version in CVS. Each Executor creates an AppDomain for each application, identified by the AppID. If a work unit comes from the same app ID and there is an existing AppDomain for it then that one is used instead of creating a new one. So all work units from the same app ID are grouped in the same AppDomain. By default the Executor only accepts one work unit at a time but it can be configured to accept multiple work units (this is the feature I am working on now) in which case it is possible to have multiple work units running at once in the same AppDomain.
I think it would be useful to have the Executor's AppDomain "time out" somehow. Currently the Manager has to terminate the AppDomain so if the Manager fails to do that the AppDomain is never unloaded. Another thing that sometimes happens is that an AppDomain cannot be unloaded, probably because of the hanging threads.
- Caching of work dlls. Currently Alchemi pushes the required dlls down to a slave system whenever it instigates work on a node. Those same dlls are purged from the system once work is complete and the manager no longer needs the node to do work? If this is the case have you thought about some sort of caching scheme so that the managers and nodes would not have re-push the dlls if they have executed this type of work before?
[Tibor Biro] Caching the DLLs would be useful. We'll need the dlls to be signed though so we can properly identify them.
- Security. Right now Alchemi is pretty wide open when it comes to security. This is actually a very large issue with a lot of things to take into consideration. This would be easier to put a box around if we could say exactly what each executor would be doing and how it would be getting its data to operate on but each executor could do a multitude of things. Maybe if we say from an executor what a work unit would need would could lock out other things. We could then, from a cooperating node, say that we only want to allow access to particular resources and even give the opportunity to trust particular managers fully. I don't know, I am just throwing out ideas here.
[Tibor Biro] I agree that security is currently an issue. I see this controlled mainly from the Executor at this point. In the future a centralized control would be nice to have. In the end what I would like to see is a way to specify the rights based on several parameters such as where the dll is coming from, who the user is and whether the dll is signed or not. I am thinking about something like the .NET Code Access Security Policies, maybe the built-in CAS securities can be used.
Okay, that should be enough questions to help me get my mind wrapped around it a little tighter and see how I might contribute what I have or modify what I have to help out.
Thanks,
John
--
Life should NOT be a journey to the grave with the intention of arriving safely in an attractive and well preserved body, but rather to skid in sideways, paddle in one hand, beer in the other, body thoroughly used up, totally worn out and screaming "WOO HOO what a ride!"
