I have a little request: can someone please explain in more detail why the data model changes are hard with pointers to code?
Marlon On 3/17/14 3:30 AM, Sachith Withana wrote: > Thanks for very informative response. > > Please read my inline comments > > > On Thu, Mar 13, 2014 at 1:27 AM, Eran Chinthaka Withana < > [email protected]> wrote: > >> Hi, >> >> First sorry for the late reply and thanks for the detailed message. Here is >> what I think. >> >> Seems like we are getting burnt frequently simply because the data model is >> hierarchical. All the queries mentioned in the use cases require us to >> navigate significant portion of the model to get to the correct data. >> Assuming performance and scalability are major concerns here is what I >> think our options are. >> >> >> 1. Assuming we want to keep our hierarchical model as it is >> >> 1.1 Optimize the traversal of object model: we can easily get help >> from memory resident object graphs (like titan, hazelcast or pregel) backed >> by a read/write through persistence layer. This will let you all sorts of >> weird in memory graphs traversal >> 1.2 Optimize data retrieval: we can create indexes for each query >> while maintaining the hierarchical model in the background. For example, an >> index is created mapping job to an experiment so that retrieval of >> experiment for a given job is easier without hopping through all the nodes. >> >> 2. Assuming we are flexible enough to break away from hierachical object >> model >> >> denormalize the object model and come up with a new object model that might >> duplicate data with indices. For example, the job node might contain a >> reference to all of its parents (task, node, workflow, experiment) within >> itself. This will also include improving the data update model to separate >> frequently accessed objects from not so frequent ones. >> >> I'd encourage to draw two object models. >> >> 1. a denormed object model evolved from the current hierarchical object >> model >> > What should be the purpose when denormalizing the object model? to optimize > the queries by having a flatter structure to support those queries directly? > > >> 2. an object model that can directly support the queries that you have >> > The current Data model supports all the usecases that we have. But the > problems are the low efficiency and it not being convenient enough to alter > the Data Model ( in terms of implementation) > > >> Once we have this two, we can merge these two to get the ideal object model >> which can support most of the existing scenarios. >> >> Finally, on the places where data model changes frequently try to define >> those more losely using name value pair rather than using a strict schema. >> This will not only enable you to expand the model but also to version based >> on the evolution of the model. >> > Finally, think about putting an API in front of registry. It can initially >> be as simple as "getObjectById(String id):XMLObject/JsonObject". But later, >> a second level API can be introduced to do queries like >> "getObjectThroughSecondaryIndex(String secondaryIndexName, String >> secondaryIndexValue, String objectName):List<XMLObject/JsonObject>". This >> can be used to do things like retrieving all jobs in a workflow by issuing >> "getObjectThroughSecondaryIndex("WorkflowId", <workflow-id>, "JobDetails")" >> >> > We do have an interface in front of the registry/database (called the > Registry CPI) and uses key value pairs in storing data. > What did you mean by "version based evolution of the model" ? > > >> Lets first pick one of the above options (hierarchical vs denormed) and >> then dig deep into the use cases mentioned in the mail. Without that, its >> hard to discuss with every option. >> >> >> Thanks, >> Eran Chinthaka Withana >> >> >> On Thu, Mar 6, 2014 at 8:28 PM, Sachith Withana <[email protected]> >> wrote: >> >>> Hi all, >>> >>> This is a follow up discussion of what we had on the “Current Database >>> remodeling of Apache Airavata” [1], followed by a successful Google on >> air >>> hangout. [2] >>> >>> Prevailing Registry ( Registry CPI) >>> >>> This is the current Data Model design of the Registry. [3] >>> >>> Currently we use a MySQL database abstracted by an OpenJPA layer. >>> >>> The Registry contains >>> >>> - >>> >>> Experiments related data ( refer to the Data Model design : [3]) >>> includes experiment,application,node level statuses,errors, Scheduling >>> and QOS( user provided) information, Inputs and outputs of each >>> experiment,node and the application. >>> - >>> >>> Application Catalogs >>> contains the descriptors( host, application, service) >>> - >>> >>> Gateway Information >>> - >>> >>> User information ( mostly admin users of the gateways) >>> >>> >>> Problems faced >>> >>> >>> Note: We haven’t done any performance testing on the registry or even >>> included the current registry in a release yet. >>> >>> >>> - >>> >>> Application Catalogs ( Descriptors) >>> the current version of the Application Catalogs are used as XML files. >>> We are storing them in the registry as blobs → we cannot query them. >>> - >>> >>> Data Model Changes >>> The data models are highly hierarchical. This causes a lot of problems >>> when the Data Models need to be changed. >>> Data Models are expected to change heavily within the development >>> phases(0.12,0.13…) until we settle on a concrete solution for the >>> production release (1.0). >>> To accommodate even the small changes of the model, we need to go >>> through several levels of costly code level changes due to the current >>> implementation. >>> It can be costly since the Data Models keep changing all the time. >>> - >>> >>> Hierarchy causing overhead >>> Since the whole current data model is hierarchical, there is a >>> significant overhead in retrieving data. >>> ex: To get the an Experiment, you need make multiple queries from >> bottom >>> up ( from the job level to the experiment level ( job → task → node → >>> workflow → experiment) ) to get the whole Experiment. >>> >>> >>> Use cases >>> >>> Here are some typical queries Airavata should support ( with respect to >> the >>> gateways that are being integrated with Airavata) >>> Some gateways use workflows while the others use single job submission. >>> >>> >>> - >>> >>> *ParamChem* ( Workflow oriented) >>> - >>> >>> get the data of each node ( of the Workflow) >>> - >>> >>> inputs >>> - >>> >>> outputs >>> - >>> >>> status >>> - >>> >>> get updated node data since last retrieval (wish) >>> >>> >>> >>> - >>> >>> *CIPRES* ( Single Job Submission) >>> - >>> >>> get Experiment Summary >>> - >>> >>> metadata >>> - >>> >>> statistics >>> - >>> >>> inputs >>> - >>> >>> parameters >>> - >>> >>> intermediate data >>> - >>> >>> progress >>> >>> >>> - >>> >>> Clone an existing experiment ( with either different descriptors or >>> inputs) >>> - >>> >>> Store output files ( wish) >>> >>> >>> >>> - >>> >>> *UltraScan* ( Single Job) >>> - >>> >>> get Job level status ( Gfac level status) ( it’s the second lowest >>> level of statuses, refer to the Data Model Design [3]) >>> - >>> >>> get Application Level Statuses ( The ultraScan application issues >>> statuses, we need to get them to the user) >>> - >>> >>> get Output data >>> - >>> >>> *CyberGateway *(Single Job Submission) >>> - >>> >>> get Summary of all Experiments >>> - >>> >>> metadata >>> - >>> >>> status >>> - >>> >>> progress >>> >>> >>> Requirements/Suggestions >>> >>> - Here are the Data Persistence Requirements [4] >>> - >>> >>> Application Catalog >>> proper way and a place to store the application catalogs so that it >> can >>> be queriable >>> >>> >>> >>> - >>> >>> Meta-Data Catalog >>> Our Data Model is highly hierarchical. >>> Since the Data Models will keep changing in the development phase ( >>> until a production release) , we need to come up with a way to make it >>> facilitate the hierarchical changes >>> - >>> >>> Separate out the registry, Data Store, Provenance ...etc >>> >>> >>> Wish List >>> >>> - >>> >>> File Management >>> Meta Data extraction from large files and store them >>> >>> >>> Special Thanks to Saminda for creating the Data Persistent requirements >>> document and the whole Airavata team for helping out on this analysis. >>> >>> [1] >>> >>> >> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results >>> [2] >>> < >>> >> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results >>> http://www.youtube.com/watch?v=EY6oPwqi1g4 >>> >>> [< >>> >> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results >>>> 3] >>> https://github.com/apache/airavata/tree/master/airavata-api >>> [4] >>> >>> >> https://docs.google.com/document/d/1yhUlwq5Q3WNMAan3cdpKYVT2AJsIL3VAEicdRilskRw >>> -- >>> Thanks, >>> Sachith Withana >>> > >
