Great description of the use cases.

Hate to sound like a broken record here, but the Apache OODT file manager
along with its integration with Apache Lucene/Solr, and Tika and a number
of other technologies I think totally fits the need here.

I realize that along with that I should put my money where my mouth
is with the old "patches welcome" moniker :) Maybe I will.. :)

Just my 2c.

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-283, Mailstop: 171-246
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Sachith Withana <[email protected]>
Reply-To: "[email protected]"
<[email protected]>
Date: Thursday, March 6, 2014 8:28 PM
To: "[email protected]" <[email protected]>
Subject: Airavata Data Management Challenges

>Hi all,
>
>This is a follow up discussion of what we had on the “Current Database
>remodeling of Apache Airavata” [1], followed by a successful Google on air
>hangout. [2]
>
>Prevailing Registry ( Registry CPI)
>
>This is the current Data Model design of the Registry. [3]
>
>Currently we use a MySQL database abstracted by an OpenJPA layer.
>
>The Registry contains
>
>   -
>
>   Experiments related data ( refer to the Data Model design : [3])
>   includes experiment,application,node level statuses,errors, Scheduling
>   and QOS( user provided) information, Inputs and outputs of each
>   experiment,node and the application.
>   -
>
>   Application Catalogs
>   contains the descriptors( host, application, service)
>   -
>
>   Gateway Information
>   -
>
>   User information ( mostly admin users of the gateways)
>
>
>Problems faced
>
>
>Note: We haven’t done any performance testing on the registry or even
>included the current registry in a release yet.
>
>
>   -
>
>   Application Catalogs ( Descriptors)
>   the current version of the Application Catalogs are used as XML files.
>   We are storing them in the registry as blobs →  we cannot query them.
>   -
>
>   Data Model Changes
>   The data models are highly hierarchical. This causes a lot of problems
>   when the Data Models need to be changed.
>   Data Models are expected to change heavily within the development
>   phases(0.12,0.13…) until we settle on a concrete solution for the
>   production release (1.0).
>   To accommodate even the small changes of the model, we need to go
>   through several levels of costly code level changes due to the current
>   implementation.
>   It can be costly since the Data Models keep changing all the time.
>   -
>
>   Hierarchy causing overhead
>   Since the whole current data model is hierarchical,  there is a
>   significant overhead in retrieving data.
>   ex: To get the an Experiment, you need make multiple queries from
>bottom
>   up ( from the job level to the experiment level ( job → task → node →
>   workflow → experiment) ) to get the whole Experiment.
>
>
>Use cases
>
>Here are some typical queries Airavata should support ( with respect to
>the
>gateways that are being integrated with Airavata)
>Some gateways use workflows while the others use single job submission.
>
>
>   -
>
>   *ParamChem* ( Workflow oriented)
>   -
>
>      get the data of each node ( of the Workflow)
>      -
>
>         inputs
>         -
>
>         outputs
>         -
>
>         status
>         -
>
>      get updated node data since last retrieval (wish)
>
>
>
>   -
>
>   *CIPRES* ( Single Job Submission)
>   -
>
>      get Experiment Summary
>      -
>
>         metadata
>         -
>
>         statistics
>         -
>
>            inputs
>            -
>
>            parameters
>            -
>
>            intermediate data
>            -
>
>         progress
>
>
>   -
>
>   Clone an existing experiment ( with either different descriptors or
>   inputs)
>   -
>
>   Store output files ( wish)
>
>
>
>   -
>
>   *UltraScan* ( Single Job)
>   -
>
>      get Job level status ( Gfac level status) ( it’s the second lowest
>      level of statuses, refer to the Data Model Design [3])
>      -
>
>      get Application Level Statuses ( The ultraScan application issues
>      statuses, we need to get them to the user)
>      -
>
>      get Output data
>      -
>
>   *CyberGateway *(Single Job Submission)
>   -
>
>      get Summary of all Experiments
>      -
>
>         metadata
>         -
>
>         status
>         -
>
>         progress
>
>
>Requirements/Suggestions
>
>   - Here are the Data Persistence Requirements [4]
>   -
>
>   Application Catalog
>   proper way and a place to store the application catalogs so that it can
>   be queriable
>
>
>
>   -
>
>   Meta-Data Catalog
>   Our Data Model is highly hierarchical.
>   Since the Data Models will keep changing in the development phase (
>   until a production release) , we need to come up with a way to make it
>   facilitate the hierarchical changes
>   -
>
>   Separate out the registry, Data Store, Provenance ...etc
>
>
>Wish List
>
>   -
>
>   File Management
>   Meta Data extraction from large files and store them
>
>
>Special Thanks to Saminda for creating the Data Persistent requirements
>document and the whole Airavata team for helping out on this analysis.
>
>[1]
>http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjoh
>jmsd+state:results
>
>[2]
><http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjo
>hjmsd+state:results>
>http://www.youtube.com/watch?v=EY6oPwqi1g4
>
>[<http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktj
>ohjmsd+state:results>3]
>https://github.com/apache/airavata/tree/master/airavata-api
>[4]
>https://docs.google.com/document/d/1yhUlwq5Q3WNMAan3cdpKYVT2AJsIL3VAEicdRi
>lskRw
>
>-- 
>Thanks,
>Sachith Withana

Reply via email to