Thanks for very informative response. Please read my inline comments
On Thu, Mar 13, 2014 at 1:27 AM, Eran Chinthaka Withana < [email protected]> wrote: > Hi, > > First sorry for the late reply and thanks for the detailed message. Here is > what I think. > > Seems like we are getting burnt frequently simply because the data model is > hierarchical. All the queries mentioned in the use cases require us to > navigate significant portion of the model to get to the correct data. > Assuming performance and scalability are major concerns here is what I > think our options are. > > > 1. Assuming we want to keep our hierarchical model as it is > > 1.1 Optimize the traversal of object model: we can easily get help > from memory resident object graphs (like titan, hazelcast or pregel) backed > by a read/write through persistence layer. This will let you all sorts of > weird in memory graphs traversal > 1.2 Optimize data retrieval: we can create indexes for each query > while maintaining the hierarchical model in the background. For example, an > index is created mapping job to an experiment so that retrieval of > experiment for a given job is easier without hopping through all the nodes. > > 2. Assuming we are flexible enough to break away from hierachical object > model > > denormalize the object model and come up with a new object model that might > duplicate data with indices. For example, the job node might contain a > reference to all of its parents (task, node, workflow, experiment) within > itself. This will also include improving the data update model to separate > frequently accessed objects from not so frequent ones. > > I'd encourage to draw two object models. > > 1. a denormed object model evolved from the current hierarchical object > model > What should be the purpose when denormalizing the object model? to optimize the queries by having a flatter structure to support those queries directly? > 2. an object model that can directly support the queries that you have > The current Data model supports all the usecases that we have. But the problems are the low efficiency and it not being convenient enough to alter the Data Model ( in terms of implementation) > > Once we have this two, we can merge these two to get the ideal object model > which can support most of the existing scenarios. > > Finally, on the places where data model changes frequently try to define > those more losely using name value pair rather than using a strict schema. > This will not only enable you to expand the model but also to version based > on the evolution of the model. > Finally, think about putting an API in front of registry. It can initially > be as simple as "getObjectById(String id):XMLObject/JsonObject". But later, > a second level API can be introduced to do queries like > "getObjectThroughSecondaryIndex(String secondaryIndexName, String > secondaryIndexValue, String objectName):List<XMLObject/JsonObject>". This > can be used to do things like retrieving all jobs in a workflow by issuing > "getObjectThroughSecondaryIndex("WorkflowId", <workflow-id>, "JobDetails")" > > We do have an interface in front of the registry/database (called the Registry CPI) and uses key value pairs in storing data. What did you mean by "version based evolution of the model" ? > Lets first pick one of the above options (hierarchical vs denormed) and > then dig deep into the use cases mentioned in the mail. Without that, its > hard to discuss with every option. > > > Thanks, > Eran Chinthaka Withana > > > On Thu, Mar 6, 2014 at 8:28 PM, Sachith Withana <[email protected]> > wrote: > > > Hi all, > > > > This is a follow up discussion of what we had on the “Current Database > > remodeling of Apache Airavata” [1], followed by a successful Google on > air > > hangout. [2] > > > > Prevailing Registry ( Registry CPI) > > > > This is the current Data Model design of the Registry. [3] > > > > Currently we use a MySQL database abstracted by an OpenJPA layer. > > > > The Registry contains > > > > - > > > > Experiments related data ( refer to the Data Model design : [3]) > > includes experiment,application,node level statuses,errors, Scheduling > > and QOS( user provided) information, Inputs and outputs of each > > experiment,node and the application. > > - > > > > Application Catalogs > > contains the descriptors( host, application, service) > > - > > > > Gateway Information > > - > > > > User information ( mostly admin users of the gateways) > > > > > > Problems faced > > > > > > Note: We haven’t done any performance testing on the registry or even > > included the current registry in a release yet. > > > > > > - > > > > Application Catalogs ( Descriptors) > > the current version of the Application Catalogs are used as XML files. > > We are storing them in the registry as blobs → we cannot query them. > > - > > > > Data Model Changes > > The data models are highly hierarchical. This causes a lot of problems > > when the Data Models need to be changed. > > Data Models are expected to change heavily within the development > > phases(0.12,0.13…) until we settle on a concrete solution for the > > production release (1.0). > > To accommodate even the small changes of the model, we need to go > > through several levels of costly code level changes due to the current > > implementation. > > It can be costly since the Data Models keep changing all the time. > > - > > > > Hierarchy causing overhead > > Since the whole current data model is hierarchical, there is a > > significant overhead in retrieving data. > > ex: To get the an Experiment, you need make multiple queries from > bottom > > up ( from the job level to the experiment level ( job → task → node → > > workflow → experiment) ) to get the whole Experiment. > > > > > > Use cases > > > > Here are some typical queries Airavata should support ( with respect to > the > > gateways that are being integrated with Airavata) > > Some gateways use workflows while the others use single job submission. > > > > > > - > > > > *ParamChem* ( Workflow oriented) > > - > > > > get the data of each node ( of the Workflow) > > - > > > > inputs > > - > > > > outputs > > - > > > > status > > - > > > > get updated node data since last retrieval (wish) > > > > > > > > - > > > > *CIPRES* ( Single Job Submission) > > - > > > > get Experiment Summary > > - > > > > metadata > > - > > > > statistics > > - > > > > inputs > > - > > > > parameters > > - > > > > intermediate data > > - > > > > progress > > > > > > - > > > > Clone an existing experiment ( with either different descriptors or > > inputs) > > - > > > > Store output files ( wish) > > > > > > > > - > > > > *UltraScan* ( Single Job) > > - > > > > get Job level status ( Gfac level status) ( it’s the second lowest > > level of statuses, refer to the Data Model Design [3]) > > - > > > > get Application Level Statuses ( The ultraScan application issues > > statuses, we need to get them to the user) > > - > > > > get Output data > > - > > > > *CyberGateway *(Single Job Submission) > > - > > > > get Summary of all Experiments > > - > > > > metadata > > - > > > > status > > - > > > > progress > > > > > > Requirements/Suggestions > > > > - Here are the Data Persistence Requirements [4] > > - > > > > Application Catalog > > proper way and a place to store the application catalogs so that it > can > > be queriable > > > > > > > > - > > > > Meta-Data Catalog > > Our Data Model is highly hierarchical. > > Since the Data Models will keep changing in the development phase ( > > until a production release) , we need to come up with a way to make it > > facilitate the hierarchical changes > > - > > > > Separate out the registry, Data Store, Provenance ...etc > > > > > > Wish List > > > > - > > > > File Management > > Meta Data extraction from large files and store them > > > > > > Special Thanks to Saminda for creating the Data Persistent requirements > > document and the whole Airavata team for helping out on this analysis. > > > > [1] > > > > > http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results > > > > [2] > > < > > > http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results > > > > > http://www.youtube.com/watch?v=EY6oPwqi1g4 > > > > [< > > > http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results > > >3] > > https://github.com/apache/airavata/tree/master/airavata-api > > [4] > > > > > https://docs.google.com/document/d/1yhUlwq5Q3WNMAan3cdpKYVT2AJsIL3VAEicdRilskRw > > > > -- > > Thanks, > > Sachith Withana > > > -- Thanks, Sachith Withana
