Re: Airavata Data Management Challenges

Marlon Pierce Mon, 17 Mar 2014 13:30:09 -0700

I have a little request: can someone please explain in more detail why
the data model changes are hard with pointers to code?


Marlon

On 3/17/14 3:30 AM, Sachith Withana wrote:
> Thanks for very informative response.
>
> Please read my inline comments
>
>
> On Thu, Mar 13, 2014 at 1:27 AM, Eran Chinthaka Withana <
> [email protected]> wrote:
>
>> Hi,
>>
>> First sorry for the late reply and thanks for the detailed message. Here is
>> what I think.
>>
>> Seems like we are getting burnt frequently simply because the data model is
>> hierarchical. All the queries mentioned in the use cases require us to
>> navigate significant portion of the model to get to the correct data.
>> Assuming performance and scalability are major concerns here is what I
>> think our options are.
>>
>>
>> 1. Assuming we want to keep our hierarchical model as it is
>>
>>       1.1 Optimize the traversal of object model: we can easily get help
>> from memory resident object graphs (like titan, hazelcast or pregel) backed
>> by a read/write through persistence layer. This will let you all sorts of
>> weird in memory graphs traversal
>>       1.2 Optimize data retrieval: we can create indexes for each query
>> while maintaining the hierarchical model in the background. For example, an
>> index is created mapping job to an experiment so that retrieval of
>> experiment for a given job is easier without hopping through all the nodes.
>>
>> 2. Assuming we are flexible enough to break away from hierachical object
>> model
>>
>> denormalize the object model and come up with a new object model that might
>> duplicate data with indices. For example, the job node might contain a
>> reference to all of its parents (task, node, workflow, experiment) within
>> itself. This will also include improving the data update model to separate
>> frequently accessed objects from not so frequent ones.
>>
>> I'd encourage to draw two object models.
>>
>> 1. a denormed object model evolved from the current hierarchical object
>> model
>>
> What should be the purpose when denormalizing the object model? to optimize
> the queries by having a flatter structure to support those queries directly?
>
>
>> 2. an object model that can directly support the queries that you have
>>
> The current Data model supports all the usecases that we have. But the
> problems are the low efficiency and it not being convenient enough to alter
> the Data Model ( in terms of implementation)
>
>
>> Once we have this two, we can merge these two to get the ideal object model
>> which can support most of the existing scenarios.
>>
>> Finally, on the places where data model changes frequently try to define
>> those more losely using name value pair rather than using a strict schema.
>> This will not only enable you to expand the model but also to version based
>> on the evolution of the model.
>>
> Finally, think about putting an API in front of registry. It can initially
>> be as simple as "getObjectById(String id):XMLObject/JsonObject". But later,
>> a second level API can be introduced to do queries like
>> "getObjectThroughSecondaryIndex(String secondaryIndexName, String
>> secondaryIndexValue, String objectName):List<XMLObject/JsonObject>". This
>> can be used to do things like retrieving all jobs in a workflow by issuing
>> "getObjectThroughSecondaryIndex("WorkflowId", <workflow-id>, "JobDetails")"
>>
>>
> We do have an interface in front of the registry/database (called the
> Registry CPI) and uses key value pairs in storing data.
> What did you mean by "version based evolution of the model" ?
>
>
>> Lets first pick one of the above options (hierarchical vs denormed) and
>> then dig deep into the use cases mentioned in the mail. Without that, its
>> hard to discuss with every option.
>>
>>
>> Thanks,
>> Eran Chinthaka Withana
>>
>>
>> On Thu, Mar 6, 2014 at 8:28 PM, Sachith Withana <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> This is a follow up discussion of what we had on the “Current Database
>>> remodeling of Apache Airavata” [1], followed by a successful Google on
>> air
>>> hangout. [2]
>>>
>>> Prevailing Registry ( Registry CPI)
>>>
>>> This is the current Data Model design of the Registry. [3]
>>>
>>> Currently we use a MySQL database abstracted by an OpenJPA layer.
>>>
>>> The Registry contains
>>>
>>>    -
>>>
>>>    Experiments related data ( refer to the Data Model design : [3])
>>>    includes experiment,application,node level statuses,errors, Scheduling
>>>    and QOS( user provided) information, Inputs and outputs of each
>>>    experiment,node and the application.
>>>    -
>>>
>>>    Application Catalogs
>>>    contains the descriptors( host, application, service)
>>>    -
>>>
>>>    Gateway Information
>>>    -
>>>
>>>    User information ( mostly admin users of the gateways)
>>>
>>>
>>> Problems faced
>>>
>>>
>>> Note: We haven’t done any performance testing on the registry or even
>>> included the current registry in a release yet.
>>>
>>>
>>>    -
>>>
>>>    Application Catalogs ( Descriptors)
>>>    the current version of the Application Catalogs are used as XML files.
>>>    We are storing them in the registry as blobs →  we cannot query them.
>>>    -
>>>
>>>    Data Model Changes
>>>    The data models are highly hierarchical. This causes a lot of problems
>>>    when the Data Models need to be changed.
>>>    Data Models are expected to change heavily within the development
>>>    phases(0.12,0.13…) until we settle on a concrete solution for the
>>>    production release (1.0).
>>>    To accommodate even the small changes of the model, we need to go
>>>    through several levels of costly code level changes due to the current
>>>    implementation.
>>>    It can be costly since the Data Models keep changing all the time.
>>>    -
>>>
>>>    Hierarchy causing overhead
>>>    Since the whole current data model is hierarchical,  there is a
>>>    significant overhead in retrieving data.
>>>    ex: To get the an Experiment, you need make multiple queries from
>> bottom
>>>    up ( from the job level to the experiment level ( job → task → node →
>>>    workflow → experiment) ) to get the whole Experiment.
>>>
>>>
>>> Use cases
>>>
>>> Here are some typical queries Airavata should support ( with respect to
>> the
>>> gateways that are being integrated with Airavata)
>>> Some gateways use workflows while the others use single job submission.
>>>
>>>
>>>    -
>>>
>>>    *ParamChem* ( Workflow oriented)
>>>    -
>>>
>>>       get the data of each node ( of the Workflow)
>>>       -
>>>
>>>          inputs
>>>          -
>>>
>>>          outputs
>>>          -
>>>
>>>          status
>>>          -
>>>
>>>       get updated node data since last retrieval (wish)
>>>
>>>
>>>
>>>    -
>>>
>>>    *CIPRES* ( Single Job Submission)
>>>    -
>>>
>>>       get Experiment Summary
>>>       -
>>>
>>>          metadata
>>>          -
>>>
>>>          statistics
>>>          -
>>>
>>>             inputs
>>>             -
>>>
>>>             parameters
>>>             -
>>>
>>>             intermediate data
>>>             -
>>>
>>>          progress
>>>
>>>
>>>    -
>>>
>>>    Clone an existing experiment ( with either different descriptors or
>>>    inputs)
>>>    -
>>>
>>>    Store output files ( wish)
>>>
>>>
>>>
>>>    -
>>>
>>>    *UltraScan* ( Single Job)
>>>    -
>>>
>>>       get Job level status ( Gfac level status) ( it’s the second lowest
>>>       level of statuses, refer to the Data Model Design [3])
>>>       -
>>>
>>>       get Application Level Statuses ( The ultraScan application issues
>>>       statuses, we need to get them to the user)
>>>       -
>>>
>>>       get Output data
>>>       -
>>>
>>>    *CyberGateway *(Single Job Submission)
>>>    -
>>>
>>>       get Summary of all Experiments
>>>       -
>>>
>>>          metadata
>>>          -
>>>
>>>          status
>>>          -
>>>
>>>          progress
>>>
>>>
>>> Requirements/Suggestions
>>>
>>>    - Here are the Data Persistence Requirements [4]
>>>    -
>>>
>>>    Application Catalog
>>>    proper way and a place to store the application catalogs so that it
>> can
>>>    be queriable
>>>
>>>
>>>
>>>    -
>>>
>>>    Meta-Data Catalog
>>>    Our Data Model is highly hierarchical.
>>>    Since the Data Models will keep changing in the development phase (
>>>    until a production release) , we need to come up with a way to make it
>>>    facilitate the hierarchical changes
>>>    -
>>>
>>>    Separate out the registry, Data Store, Provenance ...etc
>>>
>>>
>>> Wish List
>>>
>>>    -
>>>
>>>    File Management
>>>    Meta Data extraction from large files and store them
>>>
>>>
>>> Special Thanks to Saminda for creating the Data Persistent requirements
>>> document and the whole Airavata team for helping out on this analysis.
>>>
>>> [1]
>>>
>>>
>> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results
>>> [2]
>>> <
>>>
>> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results
>>> http://www.youtube.com/watch?v=EY6oPwqi1g4
>>>
>>> [<
>>>
>> http://markmail.org/thread/33bwjmgs75um46uc#query:+page:1+mid:4lguliiktjohjmsd+state:results
>>>> 3]
>>> https://github.com/apache/airavata/tree/master/airavata-api
>>> [4]
>>>
>>>
>> https://docs.google.com/document/d/1yhUlwq5Q3WNMAan3cdpKYVT2AJsIL3VAEicdRilskRw
>>> --
>>> Thanks,
>>> Sachith Withana
>>>
>
>

Re: Airavata Data Management Challenges

Reply via email to