[ 
https://issues.apache.org/jira/browse/SOLR-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16023607#comment-16023607
 ] 

Shalin Shekhar Mangar commented on SOLR-10317:
----------------------------------------------

bq. I would love to know your thoughts as to how we can come up with a better 
and much more comprehensive suite, based on the prior work done

The current solr-perf-tools has support for:
# Indexing JSON on a single node solr instance with schemaless configs
# Indexing wiki-1kb docs on single node solr instance with fixed schema
# Indexing wiki-4kb docs on single node solr instance with fixed schema
# Indexing wiki-1kb docs on 2 shard, 1 replica solr cloud

See the report attached at SOLR-9863. I like wiki data because the Lucene 
benchmarks also use it (but not exactly the same data) which gives us a sense 
how much overhead Solr has over Lucene. I also spent more time on non-cloud 
single node benchmarks because those are easier to reason about and debug. 
Troubleshooting cloud performance problems is much more difficult without 
establishing a baseline using consistent single node benchmarks.

We can go two ways from here:
# Cleanup/refactor the code to make the tool easier to extend and add 
benchmarks e.g. instead of writing python code for testing variants of a test, 
perhaps a test description written in json or a DSL could be executed
# Forget about the cleaning the code and just add more benchmarks both indexing 
and query and not worry about code duplication

There are arguments for both e.g. #1 above will encourage more people to 
contribute to benchmarks but #2 above will make your progress faster.

As for the benchmarks themselves, we already have basic indexing benchmarks 
there so we need to get started with query benchmarks. There are just so many 
possibilities here but we can start with uncached query performance first. For 
this you need to extract terms out of your data set, classify them according to 
frequency and test all combinations on a solr instance with query/filter cache 
disabled but ensure that we graph them separately. Reuse the indexes built by 
the indexing test. As an example, see the BooleanQuery section at 
https://home.apache.org/~mikemccand/lucenebench/ and the extracted terms data 
at 
https://github.com/mikemccand/luceneutil/blob/master/tasks/wikimedium.10M.tasks.
 Then repeat this with using both {{q}} and {{fq}} params. Then with dismax 
query parser and so on. Then iterate again with caches enabled this time. Then 
repeat with re-indexing data during the query tests (both cached and uncached 
cases). Use your imagination. Focus on correctness and repeatability.

> Solr Nightly Benchmarks
> -----------------------
>
>                 Key: SOLR-10317
>                 URL: https://issues.apache.org/jira/browse/SOLR-10317
>             Project: Solr
>          Issue Type: Task
>            Reporter: Ishan Chattopadhyaya
>              Labels: gsoc2017, mentor
>         Attachments: changes-lucene-20160907.json, 
> changes-solr-20160907.json, managed-schema, 
> Narang-Vivek-SOLR-10317-Solr-Nightly-Benchmarks.docx, 
> Narang-Vivek-SOLR-10317-Solr-Nightly-Benchmarks-FINAL-PROPOSAL.pdf, 
> solrconfig.xml
>
>
> Solr needs nightly benchmarks reporting. Similar Lucene benchmarks can be 
> found here, https://home.apache.org/~mikemccand/lucenebench/.
> Preferably, we need:
> # A suite of benchmarks that build Solr from a commit point, start Solr 
> nodes, both in SolrCloud and standalone mode, and record timing information 
> of various operations like indexing, querying, faceting, grouping, 
> replication etc.
> # It should be possible to run them either as an independent suite or as a 
> Jenkins job, and we should be able to report timings as graphs (Jenkins has 
> some charting plugins).
> # The code should eventually be integrated in the Solr codebase, so that it 
> never goes out of date.
> There is some prior work / discussion:
> # https://github.com/shalinmangar/solr-perf-tools (Shalin)
> # https://github.com/chatman/solr-upgrade-tests/blob/master/BENCHMARKS.md 
> (Ishan/Vivek)
> # SOLR-2646 & SOLR-9863 (Mark Miller)
> # https://home.apache.org/~mikemccand/lucenebench/ (Mike McCandless)
> # https://github.com/lucidworks/solr-scale-tk (Tim Potter)
> There is support for building, starting, indexing/querying and stopping Solr 
> in some of these frameworks above. However, the benchmarks run are very 
> limited. Any of these can be a starting point, or a new framework can as well 
> be used. The motivation is to be able to cover every functionality of Solr 
> with a corresponding benchmark that is run every night.
> Proposing this as a GSoC 2017 project. I'm willing to mentor, and I'm sure 
> [~shalinmangar] and [[email protected]] would help here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to