Re: Airavata Data Management Challenges

Sachith Withana Mon, 17 Mar 2014 00:32:25 -0700

Thanks for very informative response.

Please read my inline comments



On Thu, Mar 13, 2014 at 1:27 AM, Eran Chinthaka Withana <
[email protected]> wrote:

> Hi,
>
> First sorry for the late reply and thanks for the detailed message. Here is
> what I think.
>
> Seems like we are getting burnt frequently simply because the data model is
> hierarchical. All the queries mentioned in the use cases require us to
> navigate significant portion of the model to get to the correct data.
> Assuming performance and scalability are major concerns here is what I
> think our options are.
>
>
> 1. Assuming we want to keep our hierarchical model as it is
>
>       1.1 Optimize the traversal of object model: we can easily get help
> from memory resident object graphs (like titan, hazelcast or pregel) backed
> by a read/write through persistence layer. This will let you all sorts of
> weird in memory graphs traversal
>       1.2 Optimize data retrieval: we can create indexes for each query
> while maintaining the hierarchical model in the background. For example, an
> index is created mapping job to an experiment so that retrieval of
> experiment for a given job is easier without hopping through all the nodes.
>
> 2. Assuming we are flexible enough to break away from hierachical object
> model
>
> denormalize the object model and come up with a new object model that might
> duplicate data with indices. For example, the job node might contain a
> reference to all of its parents (task, node, workflow, experiment) within
> itself. This will also include improving the data update model to separate
> frequently accessed objects from not so frequent ones.
>
> I'd encourage to draw two object models.
>
> 1. a denormed object model evolved from the current hierarchical object
> model
>

What should be the purpose when denormalizing the object model? to optimize
the queries by having a flatter structure to support those queries directly?


> 2. an object model that can directly support the queries that you have
>

The current Data model supports all the usecases that we have. But the
problems are the low efficiency and it not being convenient enough to alter
the Data Model ( in terms of implementation)


>
> Once we have this two, we can merge these two to get the ideal object model
> which can support most of the existing scenarios.
>
> Finally, on the places where data model changes frequently try to define
> those more losely using name value pair rather than using a strict schema.
> This will not only enable you to expand the model but also to version based
> on the evolution of the model.
>
Finally, think about putting an API in front of registry. It can initially
> be as simple as "getObjectById(String id):XMLObject/JsonObject". But later,
> a second level API can be introduced to do queries like
> "getObjectThroughSecondaryIndex(String secondaryIndexName, String
> secondaryIndexValue, String objectName):List<XMLObject/JsonObject>". This
> can be used to do things like retrieving all jobs in a workflow by issuing
> "getObjectThroughSecondaryIndex("WorkflowId", <workflow-id>, "JobDetails")"
>
>
We do have an interface in front of the registry/database (called the
Registry CPI) and uses key value pairs in storing data.
What did you mean by "version based evolution of the model" ?


> Lets first pick one of the above options (hierarchical vs denormed) and
> then dig deep into the use cases mentioned in the mail. Without that, its
> hard to discuss with every option.
>
>
> Thanks,
> Eran Chinthaka Withana
>
>
> On Thu, Mar 6, 2014 at 8:28 PM, Sachith Withana <[email protected]>
> wrote:
>
> > Hi all,
> >
> > This is a follow up discussion of what we had on the “Current Database
> > remodeling of Apache Airavata” [1], followed by a successful Google on
> air
> > hangout. [2]
> >
> > Prevailing Registry ( Registry CPI)
> >
> > This is the current Data Model design of the Registry. [3]
> >
> > Currently we use a MySQL database abstracted by an OpenJPA layer.
> >
> > The Registry contains
> >
> >    -
> >
> >    Experiments related data ( refer to the Data Model design : [3])
> >    includes experiment,application,node level statuses,errors, Scheduling
> >    and QOS( user provided) information, Inputs and outputs of each
> >    experiment,node and the application.
> >    -
> >
> >    Application Catalogs
> >    contains the descriptors( host, application, service)
> >    -
> >
> >    Gateway Information
> >    -
> >
> >    User information ( mostly admin users of the gateways)
> >
> >
> > Problems faced
> >
> >
> > Note: We haven’t done any performance testing on the registry or even
> > included the current registry in a release yet.
> >
> >
> >    -
> >
> >    Application Catalogs ( Descriptors)
> >    the current version of the Application Catalogs are used as XML files.
> >    We are storing them in the registry as blobs →  we cannot query them.
> >    -
> >
> >    Data Model Changes
> >    The data models are highly hierarchical. This causes a lot of problems
> >    when the Data Models need to be changed.
> >    Data Models are expected to change heavily within the development
> >    phases(0.12,0.13…) until we settle on a concrete solution for the
> >    production release (1.0).
> >    To accommodate even the small changes of the model, we need to go
> >    through several levels of costly code level changes due to the current
> >    implementation.
> >    It can be costly since the Data Models keep changing all the time.
> >    -
> >
> >    Hierarchy causing overhead
> >    Since the whole current data model is hierarchical,  there is a
> >    significant overhead in retrieving data.
> >    ex: To get the an Experiment, you need make multiple queries from
> bottom
> >    up ( from the job level to the experiment level ( job → task → node →
> >    workflow → experiment) ) to get the whole Experiment.
> >
> >
> > Use cases
> >
> > Here are some typical queries Airavata should support ( with respect to
> the
> > gateways that are being integrated with Airavata)
> > Some gateways use workflows while the others use single job submission.
> >
> >
> >    -
> >
> >    *ParamChem* ( Workflow oriented)
> >    -
> >
> >       get the data of each node ( of the Workflow)
> >       -
> >
> >          inputs
> >          -
> >
> >          outputs
> >          -
> >
> >          status
> >          -
> >
> >       get updated node data since last retrieval (wish)
> >
> >
> >
> >    -
> >
> >    *CIPRES* ( Single Job Submission)
> >    -
> >
> >       get Experiment Summary
> >       -
> >
> >          metadata
> >          -
> >
> >          statistics
> >          -
> >
> >             inputs
> >             -
> >
> >             parameters
> >             -
> >
> >             intermediate data
> >             -
> >
> >          progress
> >
> >
> >    -
> >
> >    Clone an existing experiment ( with either different descriptors or
> >    inputs)
> >    -
> >
> >    Store output files ( wish)
> >
> >
> >
> >    -
> >
> >    *UltraScan* ( Single Job)
> >    -
> >
> >       get Job level status ( Gfac level status) ( it’s the second lowest
> >       level of statuses, refer to the Data Model Design [3])
> >       -
> >
> >       get Application Level Statuses ( The ultraScan application issues
> >       statuses, we need to get them to the user)
> >       -
> >
> >       get Output data
> >       -
> >
> >    *CyberGateway *(Single Job Submission)
> >    -
> >
> >       get Summary of all Experiments
> >       -
> >
> >          metadata
> >          -
> >
> >          status
> >          -
> >
> >          progress
> >
> >
> > Requirements/Suggestions
> >
> >    - Here are the Data Persistence Requirements [4]
> >    -
> >
> >    Application Catalog
> >    proper way and a place to store the application catalogs so that it
> can
> >    be queriable
> >
> >
> >
> >    -
> >
> >    Meta-Data Catalog
> >    Our Data Model is highly hierarchical.
> >    Since the Data Models will keep changing in the development phase (
> >    until a production release) , we need to come up with a way to make it
> >    facilitate the hierarchical changes
> >    -
> >
> >    Separate out the registry, Data Store, Provenance ...etc
> >
> >
> > Wish List
> >
> >    -
> >
> >    File Management
> >    Meta Data extraction from large files and store them
> >
> >
> > Special Thanks to Saminda for creating the Data Persistent requirements
> > document and the whole Airavata team for helping out on this analysis.
> >
> > [1]
> >
> >
> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results
> >
> > [2]
> > <
> >
> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results
> > >
> > http://www.youtube.com/watch?v=EY6oPwqi1g4
> >
> > [<
> >
> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results
> > >3]
> > https://github.com/apache/airavata/tree/master/airavata-api
> > [4]
> >
> >
> https://docs.google.com/document/d/1yhUlwq5Q3WNMAan3cdpKYVT2AJsIL3VAEicdRilskRw
> >
> > --
> > Thanks,
> > Sachith Withana
> >
>



-- 
Thanks,
Sachith Withana

Re: Airavata Data Management Challenges

Reply via email to