Here's how I've written the script to do it. Experiments loaded: 10 users, 4 projects per each user, each user would have 1000 to 100,000 experiments (1000,10,000,100,000) containing experiments like echo, Amber
Methods tested: getExperiment() searchExperimentByName searchExperimentByApplication searchExperimentByDescription WDYT? On Tue, Aug 12, 2014 at 6:58 PM, Marlon Pierce <[email protected]> wrote: > You can start with the API search functions that we have now: by name, by > application, by description. > > Marlon > > > On 8/12/14, 9:25 AM, Lahiru Gunathilake wrote: > >> On Tue, Aug 12, 2014 at 6:42 PM, Marlon Pierce <[email protected]> wrote: >> >> A single user may have O(100) to O(1000) experiments, so 10K is too small >>> as an upper bound on the registry for many users. >>> >> +1 >> >> I agree with Marlon, we have the most basic search method, but the reality >> is we need search criteria like Marlon suggest, and I am sure content >> based >> search will be pretty slow with large number of experiments. So we have to >> use a search platform like Solr to improve the performance. >> >> I think first you can do the performance test without content based search >> then we can implement that feature, then do performance analysis, if its >> too bad(more likely) then we can integrate a search platform to improve >> the >> performance. >> >> Lahiru >> >> We should really test until things break. A plot implying infinite >>> scaling (by extrapolation) is not useful. A plot showing OK scaling up >>> to >>> a certain point before things decay is useful. >>> >>> I suggest you post more carefully a set of experiments, starting with >>> Lahiru's suggestion. How many users? How many experiments per user? What >>> kind of searches? Probably the most common will be "get all my >>> experiments >>> that match this string", "get all experiments that have state FAILED", >>> and >>> "get all my experiments from the last 30 days". But the API may not have >>> the latter two yet. >>> >>> So to start, you should specify a prototype user. For example, each user >>> will have 1000 experiments: 100 AMBER jobs, 100 LAMMPS jobs, etc. Each >>> user >>> will have a unique but human readable name (user1, user2, ...). Each >>> experiment will have a unique human readable description (AMBER job 1 for >>> user 1, Amber job 2 for user 1, ...), etc that is suitable for searching. >>> >>> Post these details first, and then you can create via scripts experiment >>> registries of any size. Each experiment is different but suitable for >>> pattern searching. >>> >>> This is 10 minutes worth of thought while waiting for my tea to brew, so >>> hopefully this is the right start, but I encourage you to not take this >>> as >>> fixed instructions. >>> >>> Marlon >>> >>> >>> On 8/12/14, 8:54 AM, Lahiru Gunathilake wrote: >>> >>> Hi Sachith, >>>> >>>> How did you test this ? What database did you use ? >>>> >>>> I think 1000 experiments is a very low number. I think most important >>>> part >>>> is when there are large number of experiments, how expensive is the >>>> search >>>> and how expensive is a single experiment retrieval. >>>> >>>> If we support to get defined number of experiments in the API (I think >>>> this >>>> is the practical scenario, among 10k experiments get 100) we have to >>>> test >>>> the performance of that too. >>>> >>>> Regards >>>> Lahiru >>>> >>>> >>>> On Tue, Aug 12, 2014 at 4:59 PM, Sachith Withana <[email protected]> >>>> wrote: >>>> >>>> Hi all, >>>> >>>>> I'm testing the registry with 10,1000,10,000 Experiments and I've >>>>> tested >>>>> the database performance executing the getAllExperiments method. >>>>> I'll post the complete analysis. >>>>> >>>>> What are the other methods that I should test using? >>>>> >>>>> getExperiment(experiment_id) >>>>> searchExperiment >>>>> >>>>> Any pointers? >>>>> >>>>> >>>>> >>>>> On Wed, Jul 23, 2014 at 6:07 PM, Marlon Pierce <[email protected]> >>>>> wrote: >>>>> >>>>> Thanks, Sachith. Did you look at scaling also? That is, will the >>>>> >>>>>> operations below still be the slowest if the DB is 10x, 100x, 1000x >>>>>> bigger? >>>>>> >>>>>> Marlon >>>>>> >>>>>> >>>>>> On 7/23/14, 8:22 AM, Sachith Withana wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>>> I'm profiling the current registry in few different aspects. >>>>>>> >>>>>>> I looked into the database operations and I've listed the operations >>>>>>> that >>>>>>> take the most amount of time. >>>>>>> >>>>>>> 1. Getting the Status of an Experiment (takes around 10% of the >>>>>>> overall >>>>>>> time spent) >>>>>>> Has to go through the hierarchy of the datamodel to get to the >>>>>>> actual >>>>>>> experiment status ( node, tasks ...etc) >>>>>>> >>>>>>> 2. Dealing with the Application Inputs >>>>>>> Strangely it takes a long time for the queries regarding the >>>>>>> ApplicationInputs to complete. >>>>>>> This is a part of the new Application Catalog >>>>>>> >>>>>>> 3. Getting all the Experiments ( using the * wild card) >>>>>>> This takes the maximum amount of time when queried at first. >>>>>>> But >>>>>>> thanks >>>>>>> to the OpenJPA caching, it flattens out as we keep querying. >>>>>>> >>>>>>> To reduce the first issue, I would suggest to have a different table >>>>>>> for >>>>>>> Experiment Summaries, >>>>>>> where the status ( both the state and the state update time) would be >>>>>>> the >>>>>>> only varying entity, and use that to improve the query time for >>>>>>> Experiment >>>>>>> summaries. >>>>>>> >>>>>>> It would also help improve the performance for getting all the >>>>>>> Experiments >>>>>>> ( experiment summaries) >>>>>>> >>>>>>> WDYT? >>>>>>> >>>>>>> ToDos : Look into memory consumption ( in terms of memory leakage >>>>>>> ...etc) >>>>>>> >>>>>>> >>>>>>> Any more suggestions? >>>>>>> >>>>>>> >>>>>>> -- >>>>> Thanks, >>>>> Sachith Withana >>>>> >>>>> >>>>> >>>>> >> > -- Thanks, Sachith Withana
