+1. We should use mysql.

On Thu, Aug 14, 2014 at 10:10 AM, Marlon Pierce <[email protected]> wrote:

> +1 for mysql.
>
>
> On 8/14/14, 10:09 AM, Lahiru Gunathilake wrote:
>
>> Hi Sachith,
>>
>> I think we should use mysql which is our production recommended database.
>> I
>> think we should do the performance test with the production scenario.
>>
>> Lahiru
>>
>>
>> On Thu, Aug 14, 2014 at 7:35 PM, Sachith Withana <[email protected]>
>> wrote:
>>
>>  The Derby one.
>>>
>>>
>>> On Thu, Aug 14, 2014 at 7:06 PM, Chathuri Wimalasena <
>>> [email protected]
>>>
>>>> wrote:
>>>> Hi Sachith,
>>>>
>>>> Which DB you are using to do the profiling ?
>>>>
>>>>
>>>> On Wed, Aug 13, 2014 at 11:51 PM, Sachith Withana <[email protected]>
>>>> wrote:
>>>>
>>>>  Here's how I've written the script to do it.
>>>>>
>>>>> Experiments loaded:
>>>>> 10 users, 4 projects per each user,
>>>>> each user would have 1000 to 100,000 experiments  (1000,10,000,100,000)
>>>>> containing experiments like echo, Amber
>>>>>
>>>>> Methods tested:
>>>>>
>>>>> getExperiment()
>>>>> searchExperimentByName
>>>>> searchExperimentByApplication
>>>>> searchExperimentByDescription
>>>>>
>>>>> WDYT?
>>>>>
>>>>>
>>>>> On Tue, Aug 12, 2014 at 6:58 PM, Marlon Pierce <[email protected]>
>>>>> wrote:
>>>>>
>>>>>  You can start with the API search functions that we have now: by name,
>>>>>> by application, by description.
>>>>>>
>>>>>> Marlon
>>>>>>
>>>>>>
>>>>>> On 8/12/14, 9:25 AM, Lahiru Gunathilake wrote:
>>>>>>
>>>>>>  On Tue, Aug 12, 2014 at 6:42 PM, Marlon Pierce <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   A single user may have O(100) to O(1000) experiments, so 10K is too
>>>>>>>
>>>>>>>> small
>>>>>>>> as an upper bound on the registry for many users.
>>>>>>>>
>>>>>>>>  +1
>>>>>>>
>>>>>>> I agree with Marlon, we have the most basic search method, but the
>>>>>>> reality
>>>>>>> is we need search criteria like Marlon suggest, and I am sure content
>>>>>>> based
>>>>>>> search will be pretty slow with large number of experiments. So we
>>>>>>> have to
>>>>>>> use a search platform like Solr to improve the performance.
>>>>>>>
>>>>>>> I think first you can do the performance test without content based
>>>>>>> search
>>>>>>> then we can implement that feature, then do performance analysis, if
>>>>>>> its
>>>>>>> too bad(more likely) then we can integrate a search platform to
>>>>>>> improve the
>>>>>>> performance.
>>>>>>>
>>>>>>> Lahiru
>>>>>>>
>>>>>>>   We should really test until things break.  A plot implying infinite
>>>>>>>
>>>>>>>> scaling (by extrapolation) is not useful.  A plot showing OK scaling
>>>>>>>> up to
>>>>>>>> a certain point before things decay is useful.
>>>>>>>>
>>>>>>>> I suggest you post more carefully a set of experiments, starting
>>>>>>>> with
>>>>>>>> Lahiru's suggestion. How many users? How many experiments per user?
>>>>>>>>   What
>>>>>>>> kind of searches?  Probably the most common will be "get all my
>>>>>>>> experiments
>>>>>>>> that match this string", "get all experiments that have state
>>>>>>>> FAILED", and
>>>>>>>> "get all my experiments from the last 30 days".  But the API may not
>>>>>>>> have
>>>>>>>> the latter two yet.
>>>>>>>>
>>>>>>>> So to start, you should specify a prototype user.  For example, each
>>>>>>>> user
>>>>>>>> will have 1000 experiments: 100 AMBER jobs, 100 LAMMPS jobs, etc.
>>>>>>>> Each user
>>>>>>>> will have a unique but human readable name (user1, user2, ...). Each
>>>>>>>> experiment will have a unique human readable description (AMBER job
>>>>>>>> 1
>>>>>>>> for
>>>>>>>> user 1, Amber job 2 for user 1, ...), etc that is suitable for
>>>>>>>> searching.
>>>>>>>>
>>>>>>>> Post these details first, and then you can create via scripts
>>>>>>>> experiment
>>>>>>>> registries of any size. Each experiment is different but suitable
>>>>>>>> for
>>>>>>>> pattern searching.
>>>>>>>>
>>>>>>>> This is 10 minutes worth of thought while waiting for my tea to
>>>>>>>> brew,
>>>>>>>> so
>>>>>>>> hopefully this is the right start, but I encourage you to not take
>>>>>>>> this as
>>>>>>>> fixed instructions.
>>>>>>>>
>>>>>>>> Marlon
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/12/14, 8:54 AM, Lahiru Gunathilake wrote:
>>>>>>>>
>>>>>>>>   Hi Sachith,
>>>>>>>>
>>>>>>>>> How did you test this ? What database did you use ?
>>>>>>>>>
>>>>>>>>> I think 1000 experiments is a very low number. I think most
>>>>>>>>> important part
>>>>>>>>> is when there are large number of experiments, how expensive is the
>>>>>>>>> search
>>>>>>>>> and how expensive is a single experiment retrieval.
>>>>>>>>>
>>>>>>>>> If we support to get defined number of experiments in the API (I
>>>>>>>>> think
>>>>>>>>> this
>>>>>>>>> is the practical scenario, among 10k experiments get 100) we have
>>>>>>>>> to
>>>>>>>>> test
>>>>>>>>> the performance of that too.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Lahiru
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Aug 12, 2014 at 4:59 PM, Sachith Withana <
>>>>>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>    Hi all,
>>>>>>>>>
>>>>>>>>>  I'm testing the registry with 10,1000,10,000 Experiments and I've
>>>>>>>>>> tested
>>>>>>>>>> the database performance executing the getAllExperiments method.
>>>>>>>>>> I'll post the complete analysis.
>>>>>>>>>>
>>>>>>>>>> What are the other methods that I should test using?
>>>>>>>>>>
>>>>>>>>>> getExperiment(experiment_id)
>>>>>>>>>> searchExperiment
>>>>>>>>>>
>>>>>>>>>> Any pointers?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 23, 2014 at 6:07 PM, Marlon Pierce <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>    Thanks, Sachith. Did you look at scaling also?  That is, will
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>  operations below still be the slowest if the DB is 10x, 100x,
>>>>>>>>>>> 1000x
>>>>>>>>>>> bigger?
>>>>>>>>>>>
>>>>>>>>>>> Marlon
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 7/23/14, 8:22 AM, Sachith Withana wrote:
>>>>>>>>>>>
>>>>>>>>>>>    Hi all,
>>>>>>>>>>>
>>>>>>>>>>>  I'm profiling the current registry in few different aspects.
>>>>>>>>>>>>
>>>>>>>>>>>> I looked into the database operations and I've listed the
>>>>>>>>>>>> operations
>>>>>>>>>>>> that
>>>>>>>>>>>> take the most amount of time.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Getting the Status of an Experiment (takes around 10% of the
>>>>>>>>>>>> overall
>>>>>>>>>>>> time spent)
>>>>>>>>>>>>         Has to go through the hierarchy of the datamodel to get
>>>>>>>>>>>> to
>>>>>>>>>>>> the
>>>>>>>>>>>> actual
>>>>>>>>>>>> experiment status ( node,     tasks ...etc)
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Dealing with the Application Inputs
>>>>>>>>>>>>         Strangely it takes a long time for the queries regarding
>>>>>>>>>>>> the
>>>>>>>>>>>> ApplicationInputs to complete.
>>>>>>>>>>>>         This is a part of the new Application Catalog
>>>>>>>>>>>>
>>>>>>>>>>>> 3. Getting all the Experiments ( using the * wild card)
>>>>>>>>>>>>         This takes the maximum amount of time when queried at
>>>>>>>>>>>> first. But
>>>>>>>>>>>> thanks
>>>>>>>>>>>> to the OpenJPA        caching, it flattens out as we keep
>>>>>>>>>>>> querying.
>>>>>>>>>>>>
>>>>>>>>>>>> To reduce the first issue, I would suggest to have a different
>>>>>>>>>>>> table
>>>>>>>>>>>> for
>>>>>>>>>>>> Experiment Summaries,
>>>>>>>>>>>> where the status ( both the state and the state update time)
>>>>>>>>>>>> would be
>>>>>>>>>>>> the
>>>>>>>>>>>> only varying entity, and use that to improve the query time for
>>>>>>>>>>>> Experiment
>>>>>>>>>>>> summaries.
>>>>>>>>>>>>
>>>>>>>>>>>> It would also help improve the performance for getting all the
>>>>>>>>>>>> Experiments
>>>>>>>>>>>> ( experiment summaries)
>>>>>>>>>>>>
>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>
>>>>>>>>>>>> ToDos :  Look into memory consumption ( in terms of memory
>>>>>>>>>>>> leakage
>>>>>>>>>>>> ...etc)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Any more suggestions?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   --
>>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>> Sachith Withana
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>> --
>>>>> Thanks,
>>>>>   Sachith Withana
>>>>>
>>>>>
>>>>>
>>> --
>>> Thanks,
>>> Sachith Withana
>>>
>>>
>>>
>>
>

Reply via email to