+1 for mysql.

On 8/14/14, 10:09 AM, Lahiru Gunathilake wrote:
Hi Sachith,

I think we should use mysql which is our production recommended database. I
think we should do the performance test with the production scenario.

Lahiru


On Thu, Aug 14, 2014 at 7:35 PM, Sachith Withana <[email protected]>
wrote:

The Derby one.


On Thu, Aug 14, 2014 at 7:06 PM, Chathuri Wimalasena <[email protected]
wrote:
Hi Sachith,

Which DB you are using to do the profiling ?


On Wed, Aug 13, 2014 at 11:51 PM, Sachith Withana <[email protected]>
wrote:

Here's how I've written the script to do it.

Experiments loaded:
10 users, 4 projects per each user,
each user would have 1000 to 100,000 experiments  (1000,10,000,100,000)
containing experiments like echo, Amber

Methods tested:

getExperiment()
searchExperimentByName
searchExperimentByApplication
searchExperimentByDescription

WDYT?


On Tue, Aug 12, 2014 at 6:58 PM, Marlon Pierce <[email protected]> wrote:

You can start with the API search functions that we have now: by name,
by application, by description.

Marlon


On 8/12/14, 9:25 AM, Lahiru Gunathilake wrote:

On Tue, Aug 12, 2014 at 6:42 PM, Marlon Pierce <[email protected]>
wrote:

  A single user may have O(100) to O(1000) experiments, so 10K is too
small
as an upper bound on the registry for many users.

+1

I agree with Marlon, we have the most basic search method, but the
reality
is we need search criteria like Marlon suggest, and I am sure content
based
search will be pretty slow with large number of experiments. So we
have to
use a search platform like Solr to improve the performance.

I think first you can do the performance test without content based
search
then we can implement that feature, then do performance analysis, if
its
too bad(more likely) then we can integrate a search platform to
improve the
performance.

Lahiru

  We should really test until things break.  A plot implying infinite
scaling (by extrapolation) is not useful.  A plot showing OK scaling
up to
a certain point before things decay is useful.

I suggest you post more carefully a set of experiments, starting with
Lahiru's suggestion. How many users? How many experiments per user?
  What
kind of searches?  Probably the most common will be "get all my
experiments
that match this string", "get all experiments that have state
FAILED", and
"get all my experiments from the last 30 days".  But the API may not
have
the latter two yet.

So to start, you should specify a prototype user.  For example, each
user
will have 1000 experiments: 100 AMBER jobs, 100 LAMMPS jobs, etc.
Each user
will have a unique but human readable name (user1, user2, ...). Each
experiment will have a unique human readable description (AMBER job 1
for
user 1, Amber job 2 for user 1, ...), etc that is suitable for
searching.

Post these details first, and then you can create via scripts
experiment
registries of any size. Each experiment is different but suitable for
pattern searching.

This is 10 minutes worth of thought while waiting for my tea to brew,
so
hopefully this is the right start, but I encourage you to not take
this as
fixed instructions.

Marlon


On 8/12/14, 8:54 AM, Lahiru Gunathilake wrote:

  Hi Sachith,
How did you test this ? What database did you use ?

I think 1000 experiments is a very low number. I think most
important part
is when there are large number of experiments, how expensive is the
search
and how expensive is a single experiment retrieval.

If we support to get defined number of experiments in the API (I
think
this
is the practical scenario, among 10k experiments get 100) we have to
test
the performance of that too.

Regards
Lahiru


On Tue, Aug 12, 2014 at 4:59 PM, Sachith Withana <
[email protected]>
wrote:

   Hi all,

I'm testing the registry with 10,1000,10,000 Experiments and I've
tested
the database performance executing the getAllExperiments method.
I'll post the complete analysis.

What are the other methods that I should test using?

getExperiment(experiment_id)
searchExperiment

Any pointers?



On Wed, Jul 23, 2014 at 6:07 PM, Marlon Pierce <[email protected]>
wrote:

   Thanks, Sachith. Did you look at scaling also?  That is, will the

operations below still be the slowest if the DB is 10x, 100x, 1000x
bigger?

Marlon


On 7/23/14, 8:22 AM, Sachith Withana wrote:

   Hi all,

I'm profiling the current registry in few different aspects.

I looked into the database operations and I've listed the
operations
that
take the most amount of time.

1. Getting the Status of an Experiment (takes around 10% of the
overall
time spent)
        Has to go through the hierarchy of the datamodel to get to
the
actual
experiment status ( node,     tasks ...etc)

2. Dealing with the Application Inputs
        Strangely it takes a long time for the queries regarding
the
ApplicationInputs to complete.
        This is a part of the new Application Catalog

3. Getting all the Experiments ( using the * wild card)
        This takes the maximum amount of time when queried at
first. But
thanks
to the OpenJPA        caching, it flattens out as we keep
querying.

To reduce the first issue, I would suggest to have a different
table
for
Experiment Summaries,
where the status ( both the state and the state update time)
would be
the
only varying entity, and use that to improve the query time for
Experiment
summaries.

It would also help improve the performance for getting all the
Experiments
( experiment summaries)

WDYT?

ToDos :  Look into memory consumption ( in terms of memory leakage
...etc)


Any more suggestions?


  --
Thanks,
Sachith Withana





--
Thanks,
  Sachith Withana



--
Thanks,
Sachith Withana




Reply via email to