Re: [GSoC] Integrating DataCat System with Apache Airavata & production GridChem

Suresh Marru Fri, 27 Mar 2015 08:35:44 -0700

Hi Supun,

This is well thought out proposal. Yes, you raise good question on co-relation 
of Data ID’s and Experiment ID’s. What you propose (through notifications is 
one way to handle it). Next time when we improve the data models, we should 
consider this requirements.


For now, can you go back to your melange proposal and these to be determined 
implementation details. If you edit the proposal make sure you leave a comment 
on the change you made. 

Suresh

On Mar 26, 2015, at 5:02 AM, Supun Nakandala <[email protected]> wrote:
> 
> Hi All,
> 
> I have submitted a proposal for Google Summer of Code program to Integrate 
> DataCat System with Apache Airavata and production GridChem. My proposal can 
> be found at [1] and I have also attached it to Airavata wiki[2].
> 
> The high level architecture for this integration will be as shown in the 
> following diagram.
> 
> <fyp.png>
> 
> The flow of execution will be as follows.
> 
> Scientist uses a web based reference gateway to submit a job to a 
> computational resource using Airavata.
> Airavata executes the application in remote resources.
> After successful completion of the application execution Airavata will call 
> DataCat handler (which is a new component getting added).
> DataCat handler will then copy the generated data products from remote 
> locations and copy them to a data archive for long term preservation. This is 
> important because in the current version of Airavata data products are 
> getting generated in the /tmp folder and they are not persistent.
> After copying the data DataCat handler will publish a message to a RabbitMQ 
> message broker about the generation of the data product and other related 
> provenance information such as application name, experiment name, inputs etc.
> DataCat agent will be subscribed to the message broker and will get this 
> message. Then the agent will access the data product and index it DataCat 
> server.
> Web based reference gateway will incorporate search features which uses the 
> DataCat service methods behind the scene. 
> In the proposed solution the coupling between the two systems is minimized as 
> the communication is done via a message queue. If required Airavata can be 
> run independently without running the DataCat system.
> 
> But I have the following concern with respect to the above architecture. From 
> Airavata point of view experiment ID is used to uniquely identify a single 
> experiment execution and all other data in the registry relating to an 
> experiment are indexed under the experiment ID. In the DataCat system after 
> indexing the metadata for a particular data product it will generate a 
> document id for the metadata document. Some how we need to map this document 
> id with the experiment id in the Airavata registry.
> 
> One way to do this is to run a message queue listener in Airavata side which 
> get notified of (exp_id, metadata_doc_id) pairs and update the registry to 
> include the corresponding metadata doc id. At the DataCat end after 
> successfully indexing a metadata doc it will publish the (exp_id, 
> metadata_doc_id) pair to a message queue.
> 
> WDYT about this approach?
> 
> -Supun
> 
> [1] - 
> http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/scnakandala/5751725713522688
>  
> <http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/scnakandala/5751725713522688>
> [2] - 
> https://cwiki.apache.org/confluence/display/AIRAVATA/Integrating+DataCat+System+to+Apache+Airavata
>  
> <https://cwiki.apache.org/confluence/display/AIRAVATA/Integrating+DataCat+System+to+Apache+Airavata>

Re: [GSoC] Integrating DataCat System with Apache Airavata & production GridChem

Reply via email to