Hi,

First sorry for the late reply and thanks for the detailed message. Here is
what I think.

Seems like we are getting burnt frequently simply because the data model is
hierarchical. All the queries mentioned in the use cases require us to
navigate significant portion of the model to get to the correct data.
Assuming performance and scalability are major concerns here is what I
think our options are.


1. Assuming we want to keep our hierarchical model as it is

      1.1 Optimize the traversal of object model: we can easily get help
from memory resident object graphs (like titan, hazelcast or pregel) backed
by a read/write through persistence layer. This will let you all sorts of
weird in memory graphs traversal
      1.2 Optimize data retrieval: we can create indexes for each query
while maintaining the hierarchical model in the background. For example, an
index is created mapping job to an experiment so that retrieval of
experiment for a given job is easier without hopping through all the nodes.

2. Assuming we are flexible enough to break away from hierachical object
model

denormalize the object model and come up with a new object model that might
duplicate data with indices. For example, the job node might contain a
reference to all of its parents (task, node, workflow, experiment) within
itself. This will also include improving the data update model to separate
frequently accessed objects from not so frequent ones.

I'd encourage to draw two object models.

1. a denormed object model evolved from the current hierarchical object
model
2. an object model that can directly support the queries that you have

Once we have this two, we can merge these two to get the ideal object model
which can support most of the existing scenarios.

Finally, on the places where data model changes frequently try to define
those more losely using name value pair rather than using a strict schema.
This will not only enable you to expand the model but also to version based
on the evolution of the model.

Finally, think about putting an API in front of registry. It can initially
be as simple as "getObjectById(String id):XMLObject/JsonObject". But later,
a second level API can be introduced to do queries like
"getObjectThroughSecondaryIndex(String secondaryIndexName, String
secondaryIndexValue, String objectName):List<XMLObject/JsonObject>". This
can be used to do things like retrieving all jobs in a workflow by issuing
"getObjectThroughSecondaryIndex("WorkflowId", <workflow-id>, "JobDetails")"

Lets first pick one of the above options (hierarchical vs denormed) and
then dig deep into the use cases mentioned in the mail. Without that, its
hard to discuss with every option.


Thanks,
Eran Chinthaka Withana


On Thu, Mar 6, 2014 at 8:28 PM, Sachith Withana <[email protected]> wrote:

> Hi all,
>
> This is a follow up discussion of what we had on the “Current Database
> remodeling of Apache Airavata” [1], followed by a successful Google on air
> hangout. [2]
>
> Prevailing Registry ( Registry CPI)
>
> This is the current Data Model design of the Registry. [3]
>
> Currently we use a MySQL database abstracted by an OpenJPA layer.
>
> The Registry contains
>
>    -
>
>    Experiments related data ( refer to the Data Model design : [3])
>    includes experiment,application,node level statuses,errors, Scheduling
>    and QOS( user provided) information, Inputs and outputs of each
>    experiment,node and the application.
>    -
>
>    Application Catalogs
>    contains the descriptors( host, application, service)
>    -
>
>    Gateway Information
>    -
>
>    User information ( mostly admin users of the gateways)
>
>
> Problems faced
>
>
> Note: We haven’t done any performance testing on the registry or even
> included the current registry in a release yet.
>
>
>    -
>
>    Application Catalogs ( Descriptors)
>    the current version of the Application Catalogs are used as XML files.
>    We are storing them in the registry as blobs →  we cannot query them.
>    -
>
>    Data Model Changes
>    The data models are highly hierarchical. This causes a lot of problems
>    when the Data Models need to be changed.
>    Data Models are expected to change heavily within the development
>    phases(0.12,0.13…) until we settle on a concrete solution for the
>    production release (1.0).
>    To accommodate even the small changes of the model, we need to go
>    through several levels of costly code level changes due to the current
>    implementation.
>    It can be costly since the Data Models keep changing all the time.
>    -
>
>    Hierarchy causing overhead
>    Since the whole current data model is hierarchical,  there is a
>    significant overhead in retrieving data.
>    ex: To get the an Experiment, you need make multiple queries from bottom
>    up ( from the job level to the experiment level ( job → task → node →
>    workflow → experiment) ) to get the whole Experiment.
>
>
> Use cases
>
> Here are some typical queries Airavata should support ( with respect to the
> gateways that are being integrated with Airavata)
> Some gateways use workflows while the others use single job submission.
>
>
>    -
>
>    *ParamChem* ( Workflow oriented)
>    -
>
>       get the data of each node ( of the Workflow)
>       -
>
>          inputs
>          -
>
>          outputs
>          -
>
>          status
>          -
>
>       get updated node data since last retrieval (wish)
>
>
>
>    -
>
>    *CIPRES* ( Single Job Submission)
>    -
>
>       get Experiment Summary
>       -
>
>          metadata
>          -
>
>          statistics
>          -
>
>             inputs
>             -
>
>             parameters
>             -
>
>             intermediate data
>             -
>
>          progress
>
>
>    -
>
>    Clone an existing experiment ( with either different descriptors or
>    inputs)
>    -
>
>    Store output files ( wish)
>
>
>
>    -
>
>    *UltraScan* ( Single Job)
>    -
>
>       get Job level status ( Gfac level status) ( it’s the second lowest
>       level of statuses, refer to the Data Model Design [3])
>       -
>
>       get Application Level Statuses ( The ultraScan application issues
>       statuses, we need to get them to the user)
>       -
>
>       get Output data
>       -
>
>    *CyberGateway *(Single Job Submission)
>    -
>
>       get Summary of all Experiments
>       -
>
>          metadata
>          -
>
>          status
>          -
>
>          progress
>
>
> Requirements/Suggestions
>
>    - Here are the Data Persistence Requirements [4]
>    -
>
>    Application Catalog
>    proper way and a place to store the application catalogs so that it can
>    be queriable
>
>
>
>    -
>
>    Meta-Data Catalog
>    Our Data Model is highly hierarchical.
>    Since the Data Models will keep changing in the development phase (
>    until a production release) , we need to come up with a way to make it
>    facilitate the hierarchical changes
>    -
>
>    Separate out the registry, Data Store, Provenance ...etc
>
>
> Wish List
>
>    -
>
>    File Management
>    Meta Data extraction from large files and store them
>
>
> Special Thanks to Saminda for creating the Data Persistent requirements
> document and the whole Airavata team for helping out on this analysis.
>
> [1]
>
> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results
>
> [2]
> <
> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results
> >
> http://www.youtube.com/watch?v=EY6oPwqi1g4
>
> [<
> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results
> >3]
> https://github.com/apache/airavata/tree/master/airavata-api
> [4]
>
> https://docs.google.com/document/d/1yhUlwq5Q3WNMAan3cdpKYVT2AJsIL3VAEicdRilskRw
>
> --
> Thanks,
> Sachith Withana
>

Reply via email to