Re: Airavata Data Management Challenges

Marlon Pierce Fri, 07 Mar 2014 05:55:26 -0800

Patches are welcome...it will also help us make sure we are
communicating the use cases.


Marlon

On 3/6/14 11:52 PM, Mattmann, Chris A (3980) wrote:
> Great description of the use cases.
>
> Hate to sound like a broken record here, but the Apache OODT file manager
> along with its integration with Apache Lucene/Solr, and Tika and a number
> of other technologies I think totally fits the need here.
>
> I realize that along with that I should put my money where my mouth
> is with the old "patches welcome" moniker :) Maybe I will.. :)
>
> Just my 2c.
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-283, Mailstop: 171-246
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Sachith Withana <[email protected]>
> Reply-To: "[email protected]"
> <[email protected]>
> Date: Thursday, March 6, 2014 8:28 PM
> To: "[email protected]" <[email protected]>
> Subject: Airavata Data Management Challenges
>
>> Hi all,
>>
>> This is a follow up discussion of what we had on the “Current Database
>> remodeling of Apache Airavata” [1], followed by a successful Google on air
>> hangout. [2]
>>
>> Prevailing Registry ( Registry CPI)
>>
>> This is the current Data Model design of the Registry. [3]
>>
>> Currently we use a MySQL database abstracted by an OpenJPA layer.
>>
>> The Registry contains
>>
>>   -
>>
>>   Experiments related data ( refer to the Data Model design : [3])
>>   includes experiment,application,node level statuses,errors, Scheduling
>>   and QOS( user provided) information, Inputs and outputs of each
>>   experiment,node and the application.
>>   -
>>
>>   Application Catalogs
>>   contains the descriptors( host, application, service)
>>   -
>>
>>   Gateway Information
>>   -
>>
>>   User information ( mostly admin users of the gateways)
>>
>>
>> Problems faced
>>
>>
>> Note: We haven’t done any performance testing on the registry or even
>> included the current registry in a release yet.
>>
>>
>>   -
>>
>>   Application Catalogs ( Descriptors)
>>   the current version of the Application Catalogs are used as XML files.
>>   We are storing them in the registry as blobs →  we cannot query them.
>>   -
>>
>>   Data Model Changes
>>   The data models are highly hierarchical. This causes a lot of problems
>>   when the Data Models need to be changed.
>>   Data Models are expected to change heavily within the development
>>   phases(0.12,0.13…) until we settle on a concrete solution for the
>>   production release (1.0).
>>   To accommodate even the small changes of the model, we need to go
>>   through several levels of costly code level changes due to the current
>>   implementation.
>>   It can be costly since the Data Models keep changing all the time.
>>   -
>>
>>   Hierarchy causing overhead
>>   Since the whole current data model is hierarchical,  there is a
>>   significant overhead in retrieving data.
>>   ex: To get the an Experiment, you need make multiple queries from
>> bottom
>>   up ( from the job level to the experiment level ( job → task → node →
>>   workflow → experiment) ) to get the whole Experiment.
>>
>>
>> Use cases
>>
>> Here are some typical queries Airavata should support ( with respect to
>> the
>> gateways that are being integrated with Airavata)
>> Some gateways use workflows while the others use single job submission.
>>
>>
>>   -
>>
>>   *ParamChem* ( Workflow oriented)
>>   -
>>
>>      get the data of each node ( of the Workflow)
>>      -
>>
>>         inputs
>>         -
>>
>>         outputs
>>         -
>>
>>         status
>>         -
>>
>>      get updated node data since last retrieval (wish)
>>
>>
>>
>>   -
>>
>>   *CIPRES* ( Single Job Submission)
>>   -
>>
>>      get Experiment Summary
>>      -
>>
>>         metadata
>>         -
>>
>>         statistics
>>         -
>>
>>            inputs
>>            -
>>
>>            parameters
>>            -
>>
>>            intermediate data
>>            -
>>
>>         progress
>>
>>
>>   -
>>
>>   Clone an existing experiment ( with either different descriptors or
>>   inputs)
>>   -
>>
>>   Store output files ( wish)
>>
>>
>>
>>   -
>>
>>   *UltraScan* ( Single Job)
>>   -
>>
>>      get Job level status ( Gfac level status) ( it’s the second lowest
>>      level of statuses, refer to the Data Model Design [3])
>>      -
>>
>>      get Application Level Statuses ( The ultraScan application issues
>>      statuses, we need to get them to the user)
>>      -
>>
>>      get Output data
>>      -
>>
>>   *CyberGateway *(Single Job Submission)
>>   -
>>
>>      get Summary of all Experiments
>>      -
>>
>>         metadata
>>         -
>>
>>         status
>>         -
>>
>>         progress
>>
>>
>> Requirements/Suggestions
>>
>>   - Here are the Data Persistence Requirements [4]
>>   -
>>
>>   Application Catalog
>>   proper way and a place to store the application catalogs so that it can
>>   be queriable
>>
>>
>>
>>   -
>>
>>   Meta-Data Catalog
>>   Our Data Model is highly hierarchical.
>>   Since the Data Models will keep changing in the development phase (
>>   until a production release) , we need to come up with a way to make it
>>   facilitate the hierarchical changes
>>   -
>>
>>   Separate out the registry, Data Store, Provenance ...etc
>>
>>
>> Wish List
>>
>>   -
>>
>>   File Management
>>   Meta Data extraction from large files and store them
>>
>>
>> Special Thanks to Saminda for creating the Data Persistent requirements
>> document and the whole Airavata team for helping out on this analysis.
>>
>> [1]
>> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjoh
>> jmsd+state:results
>>
>> [2]
>> <http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjo
>> hjmsd+state:results>
>> http://www.youtube.com/watch?v=EY6oPwqi1g4
>>
>> [<http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktj
>> ohjmsd+state:results>3]
>> https://github.com/apache/airavata/tree/master/airavata-api
>> [4]
>> https://docs.google.com/document/d/1yhUlwq5Q3WNMAan3cdpKYVT2AJsIL3VAEicdRi
>> lskRw
>>
>> -- 
>> Thanks,
>> Sachith Withana

Re: Airavata Data Management Challenges

Reply via email to