Hi dev,

I have pushed the prototype project with Apache Helix on the Airavata Sandbox 
GitHub. This project creates a simple task execution workflow (DAG) with 4 
tasks, and runs it on 3 workers (the cluster contains 1 controller node, 3 
worker nodes, and 1 manager node). Here’s the link: 
https://github.com/apache/airavata-sandbox/tree/master/helix-playground

After discussing with Suresh, Supun, and Apoorv, we have safely agreed upon 
considering using Helix to perform task execution for Airavata. I will now work 
on the architectural changes needed to accommodate Helix. I shall update this 
list with proposed designs. In parallel, I shall also start work on the develop 
branch to use Helix.

Thanks and Regards,
Gourav Shenoy

From: "Shenoy, Gourav Ganesh" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, June 21, 2017 at 1:24 PM
To: "[email protected]" <[email protected]>
Subject: Apache Helix as a Task Execution Framework for Airavata

Hi dev,

Apache Helix is a generic cluster management framework, which allows one to 
build highly scalable and fault tolerant distributed systems. It provides a 
range of functionalities, including:

·         Automatic assignment of resources (task executors) and it’s 
partitions (parallelism of resources) to nodes in the cluster.

·         Detecting failure of nodes in the cluster, and taking appropriate 
actions to recover them back.

·         Cluster management – adding nodes and resources to cluster 
dynamically, load balancing.

·         Ability to define an IDEAL STATE for a node – and defining STATE 
transitions in case the state for a node deviates from the IDEAL one.

Apart from these, Helix also provides out-of-the-box APIs to perform 
Distributed Task Execution. Some of the concepts Helix uses are ideal to our 
Airavata task execution use-case. These concepts include:

·         Tasks – actual runnable logic executors (eg: job submission, data 
staging, etc). Tasks return a TaskResult object which contains the state of the 
task once completed. These include, COMPLETED, FAILED, FATAL_FAILED. Difference 
between FAILED and FATAL_FAILED, is that FAILED tasks are re-run by Helix 
(threshold can be set), whereas FATAL_FAILED tasks are not.

·         Jobs – A combination of tasks, without dependencies; i.e. if there 
are > 1 tasks, they are run in parallel across workers.

·         Workflow – A combination of jobs arranged in a DAG. In a ONE-TIME 
workflow, once all jobs are completed, the workflow ends. In a RECURRING 
workflow, you can schedule workflows to run periodically.

·         Job Queues – Another type of workflow, but never ends – keeps 
accepting new incoming jobs. Ends only when user terminates it.

[cid:[email protected]]


·         Helix also allows users to share data (key-value pairs) across 
Tasks/Jobs/Workflows. The content stored at workflow layer can shared by 
different jobs belong to this workflow. Similarly content persisted at job 
layer can shared by different tasks nested in this job.

·         Helix provides APIs to POLL either a JOB or WORKFLOW to reach a 
particular state.

Some core concepts used in Helix which are important to know:

·         Participant – Is a node in a Helix cluster (a.k.a. an instance or 
worker), which host resources (a.k.a. tasks).

·         Controller – Is a node in a Helix cluster that monitors and controls 
the Participant nodes. The controller is responsible for checking if the state 
of a participant node matches the IDEAL state, and if not, perform STATE 
TRANSITIONS in order to bring that node back to IDEAL state.

·         State Model & State Transitions – Helix allows developers to define 
what state a participant node needs to be, in order to declare it healthy. 
Example, in an ONLINE-OFFLINE state model, a node is healthy if it is in ONLINE 
state; whereas if it goes OFFLINE (for any reason), we can define TRANSITION 
actions to bring it back ONLINE.

·         Cluster – Contains participants and controller nodes. One can define 
the State model for a cluster.

How can Helix be used in Airavata??
Assuming we use Helix just to perform distributed task execution, I have the 
following in mind:

·         Create Helix Tasks (by implementing the Task interface) for each of 
our job-submission, data-staging, etc. These tasks are called resources.

·         Create Participant nodes (a.k.a. workers) to hold these resources. 
Helix allows us to create resource partitions, such that if we need a Task to 
run in parallel across workers, we can set the num_partitions > 1 for that 
resource.

·         Define a StateModel, either an OnlineOffline or MasterSlave, and 
necessary state transitions. With state transitions we can control the behavior 
of the participant nodes.

·         Create a WORKFLOW to execute a single experiment. This workflow will 
contain DAG necessary to run that experiment.

·         Create a long running QUEUE to keep accepting in-coming experiment 
requests. Each new experiment request will result in creation of a new JOB to 
be added to this queue – this job will contain one task – which is to create 
and run the workflow (mentioned in bullet above).

I have managed to get a working task execution framework prototype with Helix 
(Java). I am improving it to accommodate mock airavata services as tasks, and 
mock experiment DAGs as workflows. Before we can finalize on whether or not to 
use Helix, I would like to demonstrate this prototype and then take it ahead 
from there.

I would love to hear more thoughts, suggestions or comments about this 
proposal. If anyone is familiar with Helix, I would love to hear your inputs.

Thanks and Regards,
Gourav Shenoy

Reply via email to