Hi Sachith, I think we should use mysql which is our production recommended database. I think we should do the performance test with the production scenario.
Lahiru On Thu, Aug 14, 2014 at 7:35 PM, Sachith Withana <[email protected]> wrote: > The Derby one. > > > On Thu, Aug 14, 2014 at 7:06 PM, Chathuri Wimalasena <[email protected] > > wrote: > >> Hi Sachith, >> >> Which DB you are using to do the profiling ? >> >> >> On Wed, Aug 13, 2014 at 11:51 PM, Sachith Withana <[email protected]> >> wrote: >> >>> Here's how I've written the script to do it. >>> >>> Experiments loaded: >>> 10 users, 4 projects per each user, >>> each user would have 1000 to 100,000 experiments (1000,10,000,100,000) >>> containing experiments like echo, Amber >>> >>> Methods tested: >>> >>> getExperiment() >>> searchExperimentByName >>> searchExperimentByApplication >>> searchExperimentByDescription >>> >>> WDYT? >>> >>> >>> On Tue, Aug 12, 2014 at 6:58 PM, Marlon Pierce <[email protected]> wrote: >>> >>>> You can start with the API search functions that we have now: by name, >>>> by application, by description. >>>> >>>> Marlon >>>> >>>> >>>> On 8/12/14, 9:25 AM, Lahiru Gunathilake wrote: >>>> >>>>> On Tue, Aug 12, 2014 at 6:42 PM, Marlon Pierce <[email protected]> >>>>> wrote: >>>>> >>>>> A single user may have O(100) to O(1000) experiments, so 10K is too >>>>>> small >>>>>> as an upper bound on the registry for many users. >>>>>> >>>>> +1 >>>>> >>>>> I agree with Marlon, we have the most basic search method, but the >>>>> reality >>>>> is we need search criteria like Marlon suggest, and I am sure content >>>>> based >>>>> search will be pretty slow with large number of experiments. So we >>>>> have to >>>>> use a search platform like Solr to improve the performance. >>>>> >>>>> I think first you can do the performance test without content based >>>>> search >>>>> then we can implement that feature, then do performance analysis, if >>>>> its >>>>> too bad(more likely) then we can integrate a search platform to >>>>> improve the >>>>> performance. >>>>> >>>>> Lahiru >>>>> >>>>> We should really test until things break. A plot implying infinite >>>>>> scaling (by extrapolation) is not useful. A plot showing OK scaling >>>>>> up to >>>>>> a certain point before things decay is useful. >>>>>> >>>>>> I suggest you post more carefully a set of experiments, starting with >>>>>> Lahiru's suggestion. How many users? How many experiments per user? >>>>>> What >>>>>> kind of searches? Probably the most common will be "get all my >>>>>> experiments >>>>>> that match this string", "get all experiments that have state >>>>>> FAILED", and >>>>>> "get all my experiments from the last 30 days". But the API may not >>>>>> have >>>>>> the latter two yet. >>>>>> >>>>>> So to start, you should specify a prototype user. For example, each >>>>>> user >>>>>> will have 1000 experiments: 100 AMBER jobs, 100 LAMMPS jobs, etc. >>>>>> Each user >>>>>> will have a unique but human readable name (user1, user2, ...). Each >>>>>> experiment will have a unique human readable description (AMBER job 1 >>>>>> for >>>>>> user 1, Amber job 2 for user 1, ...), etc that is suitable for >>>>>> searching. >>>>>> >>>>>> Post these details first, and then you can create via scripts >>>>>> experiment >>>>>> registries of any size. Each experiment is different but suitable for >>>>>> pattern searching. >>>>>> >>>>>> This is 10 minutes worth of thought while waiting for my tea to brew, >>>>>> so >>>>>> hopefully this is the right start, but I encourage you to not take >>>>>> this as >>>>>> fixed instructions. >>>>>> >>>>>> Marlon >>>>>> >>>>>> >>>>>> On 8/12/14, 8:54 AM, Lahiru Gunathilake wrote: >>>>>> >>>>>> Hi Sachith, >>>>>>> >>>>>>> How did you test this ? What database did you use ? >>>>>>> >>>>>>> I think 1000 experiments is a very low number. I think most >>>>>>> important part >>>>>>> is when there are large number of experiments, how expensive is the >>>>>>> search >>>>>>> and how expensive is a single experiment retrieval. >>>>>>> >>>>>>> If we support to get defined number of experiments in the API (I >>>>>>> think >>>>>>> this >>>>>>> is the practical scenario, among 10k experiments get 100) we have to >>>>>>> test >>>>>>> the performance of that too. >>>>>>> >>>>>>> Regards >>>>>>> Lahiru >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 12, 2014 at 4:59 PM, Sachith Withana < >>>>>>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>>> I'm testing the registry with 10,1000,10,000 Experiments and I've >>>>>>>> tested >>>>>>>> the database performance executing the getAllExperiments method. >>>>>>>> I'll post the complete analysis. >>>>>>>> >>>>>>>> What are the other methods that I should test using? >>>>>>>> >>>>>>>> getExperiment(experiment_id) >>>>>>>> searchExperiment >>>>>>>> >>>>>>>> Any pointers? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 23, 2014 at 6:07 PM, Marlon Pierce <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks, Sachith. Did you look at scaling also? That is, will the >>>>>>>> >>>>>>>>> operations below still be the slowest if the DB is 10x, 100x, 1000x >>>>>>>>> bigger? >>>>>>>>> >>>>>>>>> Marlon >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/23/14, 8:22 AM, Sachith Withana wrote: >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>>> I'm profiling the current registry in few different aspects. >>>>>>>>>> >>>>>>>>>> I looked into the database operations and I've listed the >>>>>>>>>> operations >>>>>>>>>> that >>>>>>>>>> take the most amount of time. >>>>>>>>>> >>>>>>>>>> 1. Getting the Status of an Experiment (takes around 10% of the >>>>>>>>>> overall >>>>>>>>>> time spent) >>>>>>>>>> Has to go through the hierarchy of the datamodel to get to >>>>>>>>>> the >>>>>>>>>> actual >>>>>>>>>> experiment status ( node, tasks ...etc) >>>>>>>>>> >>>>>>>>>> 2. Dealing with the Application Inputs >>>>>>>>>> Strangely it takes a long time for the queries regarding >>>>>>>>>> the >>>>>>>>>> ApplicationInputs to complete. >>>>>>>>>> This is a part of the new Application Catalog >>>>>>>>>> >>>>>>>>>> 3. Getting all the Experiments ( using the * wild card) >>>>>>>>>> This takes the maximum amount of time when queried at >>>>>>>>>> first. But >>>>>>>>>> thanks >>>>>>>>>> to the OpenJPA caching, it flattens out as we keep >>>>>>>>>> querying. >>>>>>>>>> >>>>>>>>>> To reduce the first issue, I would suggest to have a different >>>>>>>>>> table >>>>>>>>>> for >>>>>>>>>> Experiment Summaries, >>>>>>>>>> where the status ( both the state and the state update time) >>>>>>>>>> would be >>>>>>>>>> the >>>>>>>>>> only varying entity, and use that to improve the query time for >>>>>>>>>> Experiment >>>>>>>>>> summaries. >>>>>>>>>> >>>>>>>>>> It would also help improve the performance for getting all the >>>>>>>>>> Experiments >>>>>>>>>> ( experiment summaries) >>>>>>>>>> >>>>>>>>>> WDYT? >>>>>>>>>> >>>>>>>>>> ToDos : Look into memory consumption ( in terms of memory leakage >>>>>>>>>> ...etc) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Any more suggestions? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>> Thanks, >>>>>>>> Sachith Withana >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>> >>> >>> >>> -- >>> Thanks, >>> Sachith Withana >>> >>> >> > > > -- > Thanks, > Sachith Withana > > -- System Analyst Programmer PTI Lab Indiana University
