Re: Profiling the current Airavata registry

Sachith Withana Wed, 13 Aug 2014 20:52:40 -0700

Here's how I've written the script to do it.

Experiments loaded:
10 users, 4 projects per each user,
each user would have 1000 to 100,000 experiments  (1000,10,000,100,000)
containing experiments like echo, Amber


Methods tested:

getExperiment()
searchExperimentByName
searchExperimentByApplication
searchExperimentByDescription

WDYT?


On Tue, Aug 12, 2014 at 6:58 PM, Marlon Pierce <[email protected]> wrote:

> You can start with the API search functions that we have now: by name, by
> application, by description.
>
> Marlon
>
>
> On 8/12/14, 9:25 AM, Lahiru Gunathilake wrote:
>
>> On Tue, Aug 12, 2014 at 6:42 PM, Marlon Pierce <[email protected]> wrote:
>>
>>  A single user may have O(100) to O(1000) experiments, so 10K is too small
>>> as an upper bound on the registry for many users.
>>>
>> +1
>>
>> I agree with Marlon, we have the most basic search method, but the reality
>> is we need search criteria like Marlon suggest, and I am sure content
>> based
>> search will be pretty slow with large number of experiments. So we have to
>> use a search platform like Solr to improve the performance.
>>
>> I think first you can do the performance test without content based search
>> then we can implement that feature, then do performance analysis, if its
>> too bad(more likely) then we can integrate a search platform to improve
>> the
>> performance.
>>
>> Lahiru
>>
>>  We should really test until things break.  A plot implying infinite
>>> scaling (by extrapolation) is not useful.  A plot showing OK scaling up
>>> to
>>> a certain point before things decay is useful.
>>>
>>> I suggest you post more carefully a set of experiments, starting with
>>> Lahiru's suggestion. How many users? How many experiments per user?  What
>>> kind of searches?  Probably the most common will be "get all my
>>> experiments
>>> that match this string", "get all experiments that have state FAILED",
>>> and
>>> "get all my experiments from the last 30 days".  But the API may not have
>>> the latter two yet.
>>>
>>> So to start, you should specify a prototype user.  For example, each user
>>> will have 1000 experiments: 100 AMBER jobs, 100 LAMMPS jobs, etc. Each
>>> user
>>> will have a unique but human readable name (user1, user2, ...). Each
>>> experiment will have a unique human readable description (AMBER job 1 for
>>> user 1, Amber job 2 for user 1, ...), etc that is suitable for searching.
>>>
>>> Post these details first, and then you can create via scripts experiment
>>> registries of any size. Each experiment is different but suitable for
>>> pattern searching.
>>>
>>> This is 10 minutes worth of thought while waiting for my tea to brew, so
>>> hopefully this is the right start, but I encourage you to not take this
>>> as
>>> fixed instructions.
>>>
>>> Marlon
>>>
>>>
>>> On 8/12/14, 8:54 AM, Lahiru Gunathilake wrote:
>>>
>>>  Hi Sachith,
>>>>
>>>> How did you test this ? What database did you use ?
>>>>
>>>> I think 1000 experiments is a very low number. I think most important
>>>> part
>>>> is when there are large number of experiments, how expensive is the
>>>> search
>>>> and how expensive is a single experiment retrieval.
>>>>
>>>> If we support to get defined number of experiments in the API (I think
>>>> this
>>>> is the practical scenario, among 10k experiments get 100) we have to
>>>> test
>>>> the performance of that too.
>>>>
>>>> Regards
>>>> Lahiru
>>>>
>>>>
>>>> On Tue, Aug 12, 2014 at 4:59 PM, Sachith Withana <[email protected]>
>>>> wrote:
>>>>
>>>>   Hi all,
>>>>
>>>>> I'm testing the registry with 10,1000,10,000 Experiments and I've
>>>>> tested
>>>>> the database performance executing the getAllExperiments method.
>>>>> I'll post the complete analysis.
>>>>>
>>>>> What are the other methods that I should test using?
>>>>>
>>>>> getExperiment(experiment_id)
>>>>> searchExperiment
>>>>>
>>>>> Any pointers?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 23, 2014 at 6:07 PM, Marlon Pierce <[email protected]>
>>>>> wrote:
>>>>>
>>>>>   Thanks, Sachith. Did you look at scaling also?  That is, will the
>>>>>
>>>>>> operations below still be the slowest if the DB is 10x, 100x, 1000x
>>>>>> bigger?
>>>>>>
>>>>>> Marlon
>>>>>>
>>>>>>
>>>>>> On 7/23/14, 8:22 AM, Sachith Withana wrote:
>>>>>>
>>>>>>   Hi all,
>>>>>>
>>>>>>> I'm profiling the current registry in few different aspects.
>>>>>>>
>>>>>>> I looked into the database operations and I've listed the operations
>>>>>>> that
>>>>>>> take the most amount of time.
>>>>>>>
>>>>>>> 1. Getting the Status of an Experiment (takes around 10% of the
>>>>>>> overall
>>>>>>> time spent)
>>>>>>>        Has to go through the hierarchy of the datamodel to get to the
>>>>>>> actual
>>>>>>> experiment status ( node,     tasks ...etc)
>>>>>>>
>>>>>>> 2. Dealing with the Application Inputs
>>>>>>>        Strangely it takes a long time for the queries regarding the
>>>>>>> ApplicationInputs to complete.
>>>>>>>        This is a part of the new Application Catalog
>>>>>>>
>>>>>>> 3. Getting all the Experiments ( using the * wild card)
>>>>>>>        This takes the maximum amount of time when queried at first.
>>>>>>> But
>>>>>>> thanks
>>>>>>> to the OpenJPA        caching, it flattens out as we keep querying.
>>>>>>>
>>>>>>> To reduce the first issue, I would suggest to have a different table
>>>>>>> for
>>>>>>> Experiment Summaries,
>>>>>>> where the status ( both the state and the state update time) would be
>>>>>>> the
>>>>>>> only varying entity, and use that to improve the query time for
>>>>>>> Experiment
>>>>>>> summaries.
>>>>>>>
>>>>>>> It would also help improve the performance for getting all the
>>>>>>> Experiments
>>>>>>> ( experiment summaries)
>>>>>>>
>>>>>>> WDYT?
>>>>>>>
>>>>>>> ToDos :  Look into memory consumption ( in terms of memory leakage
>>>>>>> ...etc)
>>>>>>>
>>>>>>>
>>>>>>> Any more suggestions?
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>> Thanks,
>>>>> Sachith Withana
>>>>>
>>>>>
>>>>>
>>>>>
>>
>


-- 
Thanks,
Sachith Withana

Re: Profiling the current Airavata registry

Reply via email to