Hi All, We have replaced XML descriptions we used to describe applications/hosts with new design(App Catalog)[1], now now GFac has to change to adapt new App Catalog. Before jumping in to the integration I think its a good time to review the GFac architecture and modify if necessary.
According to current architecture of GFAC we have plugins which can be executed in a chain and this chained configuration is configured in an XML[2] and based on the computing resource type(based on the old XML descriptions)[3]. Basically we can configure an execution chain for GSISSH type resource and another chain for an SSHType resource and another one for some other type (like EC2). Currently we differentiate hosts based on their authentication mechanism.This architecture leads to following limitations. 1. There could be scenarios for the same host type we need a different execution pattern, according to this model we cannot have two execution chains for the same host type. 2. There could be cases where we want to run particular handler only for a given machine but in the same host type(Ex: for stampede run Handler1 at the end but not for any other machine). 3. Differentiating hosts based on authentication doesn't looks right because we have few machines authenticate in different mechanism but everything else is same. But this problem has been solved in App Catalog design and it has available authentication mechanism for a given compute resource. 4. Currently execution chain is picked initially based on host and try to execute but we do not have a fallback execution chain or any fault tolerance in experiment level. Do we have to do a fault tolerance in this level or just make the experiment failed and make Orchestrator to send another job request to GFac with a different computing resource or different authentication mechanism(if authentication failed) in a failure scenario?Note: We only have fault tolerance implemented as if a particular GFAC instance start not responding which is very rare unless we have a very heavy load. In this case another gFAC instance can pick the execution chain from the check pointed location and start executing the rest of the chain, because execution chain picking logic is statically configured in an xml and plugins itself can implement a recover method which can be invoked by the Gfac core during a recovery process. 5. Currently there is a way to configure a chain based on the gateway name but this is a simple configuration which means one gateway name can have only one configuration and it will be the same for any host(We can improve that to embed the configuration for each host in to an outer config which provide the gateway name, so for a given gateway name we can have given set of execution chains). I think GFac execution should be customizable for each gateway without interfering other gateways. 6. Currently each plugin implementations are independent(EC2,GSISSH,SSH etc) and they all have a dependency to gfac-core, if there are usecases where mix of these plugins has to run we can configure these in a single execution chain and as long as all the plugin artifacts are in the classpath of GFAC things should work out of the box. But in such a scenario how do we configure a chain(it will be hard to configure based on a particular host). My suggestion is to introduce more advanced XML configuration language with defined precedence to pick a chains. Of course there will be n! ways to order n number of plugins but in practice this will be very low number. Since we have a precedence for selecting hosts(Ex: gateways name, authentication type, host address,gateway user name or some other property like when cpuCount > 10 in host stampede). We can come up with a proper precedence order based on how specific the configuration is and we should be able to group an execution between multiple gateways etc (We can discuss about a nice way to configure the xml which covers most of the limitations explained above). To address the problem #4, if we decide GFAC has to act smart without making an execution fail we can come up with some fault chain with a precedence but since App Catalog is already have a way to define the precedence, orchestrator can fall back and submit another job to Gfac so GFac can act in a stateless way to handler a particular request(everything from app-catalog is finalized when the request comes to Gfac, Gfac just have to find the right set of handlers to use for this request). And we can change the static nature of the configuration by storing this configuration in to registry so when GFAc is up and running(Once we do a proper Admin UI), this configuration can be modified by the admin. If a new requirement comes for a particular gateway we can implement a plugin and configure that plugin. If you have any Ideas that would be great. Regards Lahiru [1] https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Application+Catalog [2] https://github.com/apache/airavata/blob/master/modules/configuration/server/src/main/resources/gfac-config.xml [3] https://github.com/apache/airavata/tree/master/modules/commons/gfac-schema/src/main/resources/schemas -- Research Assistant Science Gateways Group Indiana University
