Great description of the use cases. Hate to sound like a broken record here, but the Apache OODT file manager along with its integration with Apache Lucene/Solr, and Tika and a number of other technologies I think totally fits the need here.
I realize that along with that I should put my money where my mouth is with the old "patches welcome" moniker :) Maybe I will.. :) Just my 2c. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-283, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Sachith Withana <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Thursday, March 6, 2014 8:28 PM To: "[email protected]" <[email protected]> Subject: Airavata Data Management Challenges >Hi all, > >This is a follow up discussion of what we had on the “Current Database >remodeling of Apache Airavata” [1], followed by a successful Google on air >hangout. [2] > >Prevailing Registry ( Registry CPI) > >This is the current Data Model design of the Registry. [3] > >Currently we use a MySQL database abstracted by an OpenJPA layer. > >The Registry contains > > - > > Experiments related data ( refer to the Data Model design : [3]) > includes experiment,application,node level statuses,errors, Scheduling > and QOS( user provided) information, Inputs and outputs of each > experiment,node and the application. > - > > Application Catalogs > contains the descriptors( host, application, service) > - > > Gateway Information > - > > User information ( mostly admin users of the gateways) > > >Problems faced > > >Note: We haven’t done any performance testing on the registry or even >included the current registry in a release yet. > > > - > > Application Catalogs ( Descriptors) > the current version of the Application Catalogs are used as XML files. > We are storing them in the registry as blobs → we cannot query them. > - > > Data Model Changes > The data models are highly hierarchical. This causes a lot of problems > when the Data Models need to be changed. > Data Models are expected to change heavily within the development > phases(0.12,0.13…) until we settle on a concrete solution for the > production release (1.0). > To accommodate even the small changes of the model, we need to go > through several levels of costly code level changes due to the current > implementation. > It can be costly since the Data Models keep changing all the time. > - > > Hierarchy causing overhead > Since the whole current data model is hierarchical, there is a > significant overhead in retrieving data. > ex: To get the an Experiment, you need make multiple queries from >bottom > up ( from the job level to the experiment level ( job → task → node → > workflow → experiment) ) to get the whole Experiment. > > >Use cases > >Here are some typical queries Airavata should support ( with respect to >the >gateways that are being integrated with Airavata) >Some gateways use workflows while the others use single job submission. > > > - > > *ParamChem* ( Workflow oriented) > - > > get the data of each node ( of the Workflow) > - > > inputs > - > > outputs > - > > status > - > > get updated node data since last retrieval (wish) > > > > - > > *CIPRES* ( Single Job Submission) > - > > get Experiment Summary > - > > metadata > - > > statistics > - > > inputs > - > > parameters > - > > intermediate data > - > > progress > > > - > > Clone an existing experiment ( with either different descriptors or > inputs) > - > > Store output files ( wish) > > > > - > > *UltraScan* ( Single Job) > - > > get Job level status ( Gfac level status) ( it’s the second lowest > level of statuses, refer to the Data Model Design [3]) > - > > get Application Level Statuses ( The ultraScan application issues > statuses, we need to get them to the user) > - > > get Output data > - > > *CyberGateway *(Single Job Submission) > - > > get Summary of all Experiments > - > > metadata > - > > status > - > > progress > > >Requirements/Suggestions > > - Here are the Data Persistence Requirements [4] > - > > Application Catalog > proper way and a place to store the application catalogs so that it can > be queriable > > > > - > > Meta-Data Catalog > Our Data Model is highly hierarchical. > Since the Data Models will keep changing in the development phase ( > until a production release) , we need to come up with a way to make it > facilitate the hierarchical changes > - > > Separate out the registry, Data Store, Provenance ...etc > > >Wish List > > - > > File Management > Meta Data extraction from large files and store them > > >Special Thanks to Saminda for creating the Data Persistent requirements >document and the whole Airavata team for helping out on this analysis. > >[1] >http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjoh >jmsd+state:results > >[2] ><http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjo >hjmsd+state:results> >http://www.youtube.com/watch?v=EY6oPwqi1g4 > >[<http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktj >ohjmsd+state:results>3] >https://github.com/apache/airavata/tree/master/airavata-api >[4] >https://docs.google.com/document/d/1yhUlwq5Q3WNMAan3cdpKYVT2AJsIL3VAEicdRi >lskRw > >-- >Thanks, >Sachith Withana
