+1. We should use mysql.
On Thu, Aug 14, 2014 at 10:10 AM, Marlon Pierce <[email protected]> wrote: > +1 for mysql. > > > On 8/14/14, 10:09 AM, Lahiru Gunathilake wrote: > >> Hi Sachith, >> >> I think we should use mysql which is our production recommended database. >> I >> think we should do the performance test with the production scenario. >> >> Lahiru >> >> >> On Thu, Aug 14, 2014 at 7:35 PM, Sachith Withana <[email protected]> >> wrote: >> >> The Derby one. >>> >>> >>> On Thu, Aug 14, 2014 at 7:06 PM, Chathuri Wimalasena < >>> [email protected] >>> >>>> wrote: >>>> Hi Sachith, >>>> >>>> Which DB you are using to do the profiling ? >>>> >>>> >>>> On Wed, Aug 13, 2014 at 11:51 PM, Sachith Withana <[email protected]> >>>> wrote: >>>> >>>> Here's how I've written the script to do it. >>>>> >>>>> Experiments loaded: >>>>> 10 users, 4 projects per each user, >>>>> each user would have 1000 to 100,000 experiments (1000,10,000,100,000) >>>>> containing experiments like echo, Amber >>>>> >>>>> Methods tested: >>>>> >>>>> getExperiment() >>>>> searchExperimentByName >>>>> searchExperimentByApplication >>>>> searchExperimentByDescription >>>>> >>>>> WDYT? >>>>> >>>>> >>>>> On Tue, Aug 12, 2014 at 6:58 PM, Marlon Pierce <[email protected]> >>>>> wrote: >>>>> >>>>> You can start with the API search functions that we have now: by name, >>>>>> by application, by description. >>>>>> >>>>>> Marlon >>>>>> >>>>>> >>>>>> On 8/12/14, 9:25 AM, Lahiru Gunathilake wrote: >>>>>> >>>>>> On Tue, Aug 12, 2014 at 6:42 PM, Marlon Pierce <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> A single user may have O(100) to O(1000) experiments, so 10K is too >>>>>>> >>>>>>>> small >>>>>>>> as an upper bound on the registry for many users. >>>>>>>> >>>>>>>> +1 >>>>>>> >>>>>>> I agree with Marlon, we have the most basic search method, but the >>>>>>> reality >>>>>>> is we need search criteria like Marlon suggest, and I am sure content >>>>>>> based >>>>>>> search will be pretty slow with large number of experiments. So we >>>>>>> have to >>>>>>> use a search platform like Solr to improve the performance. >>>>>>> >>>>>>> I think first you can do the performance test without content based >>>>>>> search >>>>>>> then we can implement that feature, then do performance analysis, if >>>>>>> its >>>>>>> too bad(more likely) then we can integrate a search platform to >>>>>>> improve the >>>>>>> performance. >>>>>>> >>>>>>> Lahiru >>>>>>> >>>>>>> We should really test until things break. A plot implying infinite >>>>>>> >>>>>>>> scaling (by extrapolation) is not useful. A plot showing OK scaling >>>>>>>> up to >>>>>>>> a certain point before things decay is useful. >>>>>>>> >>>>>>>> I suggest you post more carefully a set of experiments, starting >>>>>>>> with >>>>>>>> Lahiru's suggestion. How many users? How many experiments per user? >>>>>>>> What >>>>>>>> kind of searches? Probably the most common will be "get all my >>>>>>>> experiments >>>>>>>> that match this string", "get all experiments that have state >>>>>>>> FAILED", and >>>>>>>> "get all my experiments from the last 30 days". But the API may not >>>>>>>> have >>>>>>>> the latter two yet. >>>>>>>> >>>>>>>> So to start, you should specify a prototype user. For example, each >>>>>>>> user >>>>>>>> will have 1000 experiments: 100 AMBER jobs, 100 LAMMPS jobs, etc. >>>>>>>> Each user >>>>>>>> will have a unique but human readable name (user1, user2, ...). Each >>>>>>>> experiment will have a unique human readable description (AMBER job >>>>>>>> 1 >>>>>>>> for >>>>>>>> user 1, Amber job 2 for user 1, ...), etc that is suitable for >>>>>>>> searching. >>>>>>>> >>>>>>>> Post these details first, and then you can create via scripts >>>>>>>> experiment >>>>>>>> registries of any size. Each experiment is different but suitable >>>>>>>> for >>>>>>>> pattern searching. >>>>>>>> >>>>>>>> This is 10 minutes worth of thought while waiting for my tea to >>>>>>>> brew, >>>>>>>> so >>>>>>>> hopefully this is the right start, but I encourage you to not take >>>>>>>> this as >>>>>>>> fixed instructions. >>>>>>>> >>>>>>>> Marlon >>>>>>>> >>>>>>>> >>>>>>>> On 8/12/14, 8:54 AM, Lahiru Gunathilake wrote: >>>>>>>> >>>>>>>> Hi Sachith, >>>>>>>> >>>>>>>>> How did you test this ? What database did you use ? >>>>>>>>> >>>>>>>>> I think 1000 experiments is a very low number. I think most >>>>>>>>> important part >>>>>>>>> is when there are large number of experiments, how expensive is the >>>>>>>>> search >>>>>>>>> and how expensive is a single experiment retrieval. >>>>>>>>> >>>>>>>>> If we support to get defined number of experiments in the API (I >>>>>>>>> think >>>>>>>>> this >>>>>>>>> is the practical scenario, among 10k experiments get 100) we have >>>>>>>>> to >>>>>>>>> test >>>>>>>>> the performance of that too. >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Lahiru >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Aug 12, 2014 at 4:59 PM, Sachith Withana < >>>>>>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I'm testing the registry with 10,1000,10,000 Experiments and I've >>>>>>>>>> tested >>>>>>>>>> the database performance executing the getAllExperiments method. >>>>>>>>>> I'll post the complete analysis. >>>>>>>>>> >>>>>>>>>> What are the other methods that I should test using? >>>>>>>>>> >>>>>>>>>> getExperiment(experiment_id) >>>>>>>>>> searchExperiment >>>>>>>>>> >>>>>>>>>> Any pointers? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jul 23, 2014 at 6:07 PM, Marlon Pierce <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Thanks, Sachith. Did you look at scaling also? That is, will >>>>>>>>>> the >>>>>>>>>> >>>>>>>>>> operations below still be the slowest if the DB is 10x, 100x, >>>>>>>>>>> 1000x >>>>>>>>>>> bigger? >>>>>>>>>>> >>>>>>>>>>> Marlon >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 7/23/14, 8:22 AM, Sachith Withana wrote: >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I'm profiling the current registry in few different aspects. >>>>>>>>>>>> >>>>>>>>>>>> I looked into the database operations and I've listed the >>>>>>>>>>>> operations >>>>>>>>>>>> that >>>>>>>>>>>> take the most amount of time. >>>>>>>>>>>> >>>>>>>>>>>> 1. Getting the Status of an Experiment (takes around 10% of the >>>>>>>>>>>> overall >>>>>>>>>>>> time spent) >>>>>>>>>>>> Has to go through the hierarchy of the datamodel to get >>>>>>>>>>>> to >>>>>>>>>>>> the >>>>>>>>>>>> actual >>>>>>>>>>>> experiment status ( node, tasks ...etc) >>>>>>>>>>>> >>>>>>>>>>>> 2. Dealing with the Application Inputs >>>>>>>>>>>> Strangely it takes a long time for the queries regarding >>>>>>>>>>>> the >>>>>>>>>>>> ApplicationInputs to complete. >>>>>>>>>>>> This is a part of the new Application Catalog >>>>>>>>>>>> >>>>>>>>>>>> 3. Getting all the Experiments ( using the * wild card) >>>>>>>>>>>> This takes the maximum amount of time when queried at >>>>>>>>>>>> first. But >>>>>>>>>>>> thanks >>>>>>>>>>>> to the OpenJPA caching, it flattens out as we keep >>>>>>>>>>>> querying. >>>>>>>>>>>> >>>>>>>>>>>> To reduce the first issue, I would suggest to have a different >>>>>>>>>>>> table >>>>>>>>>>>> for >>>>>>>>>>>> Experiment Summaries, >>>>>>>>>>>> where the status ( both the state and the state update time) >>>>>>>>>>>> would be >>>>>>>>>>>> the >>>>>>>>>>>> only varying entity, and use that to improve the query time for >>>>>>>>>>>> Experiment >>>>>>>>>>>> summaries. >>>>>>>>>>>> >>>>>>>>>>>> It would also help improve the performance for getting all the >>>>>>>>>>>> Experiments >>>>>>>>>>>> ( experiment summaries) >>>>>>>>>>>> >>>>>>>>>>>> WDYT? >>>>>>>>>>>> >>>>>>>>>>>> ToDos : Look into memory consumption ( in terms of memory >>>>>>>>>>>> leakage >>>>>>>>>>>> ...etc) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Any more suggestions? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>> Sachith Withana >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>> -- >>>>> Thanks, >>>>> Sachith Withana >>>>> >>>>> >>>>> >>> -- >>> Thanks, >>> Sachith Withana >>> >>> >>> >> >
