jenkins-bot has submitted this change and it was merged. Change subject: Update Rel Lab: documentation, better error handling, more configurability ......................................................................
Update Rel Lab: documentation, better error handling, more configurability README.md - first pass at basic documentation relcomp.py - catch and report on search errors - make report sections collapsible (and collapse them) - make number of examples printed configurable from the command line - refactor query string formatting - refactor ascii-ification of metric results relevancyRunner.py - make queries, labHost, config, and searchCommand global [settings] that can be overridden by config under each [test#] cqd.py - remove repeated limit init relevance.ini - add config and docs for setting examples printed per metric - added docs for global vs local settings - moved queries and config under global [settings] Bug: T126646 Change-Id: Ib5ef1717883ddfce1ec8b3cfd6fd2fdf19a86a7f --- A README.md M cqd.py M relcomp.py M relevance.ini M relevancyRunner.py 5 files changed, 270 insertions(+), 42 deletions(-) Approvals: DCausse: Looks good to me, approved jenkins-bot: Verified diff --git a/README.md b/README.md new file mode 100644 index 0000000..ad36500 --- /dev/null +++ b/README.md @@ -0,0 +1,148 @@ +# Relevanc(e|y) Lab<sup>*</sup> + +The primary purpose of the Relevance Lab is to allow us<sup>†</sup> to experiment with proposed modifications to our search process and gauge their effectiveness<sup>‡</sup> and impact<sup>§</sup> before releasing them into production, and even before doing any kind of user acceptance or A/B testing. Also, testing in the relevance lab gives an additional benefit over A/B tests (esp. in the case of very targeted changes): with A/B tests we aren't necessarily able to test the behavior of the *same query* with two different configurations. + +<small> +\* Both *relevance* and *relevancy* are attested. They mean [the same thing](https://en.wiktionary.org/wiki/relevance#Alternative_forms "See Wiktionary"). We want to be inclusive, so either is allowed. Note that *Rel Lab* saves several keystrokes and avoids having to choose. + +† Appropriate values of "us" include the Discovery team, other WMF teams, and potentially the wider community of Wiki users and developers. + +‡ "Does it do anything good?" + +§ "How many searches does it affect?" +</small> + +## Prerequisites + +* Python: There's nothing too fancy here, and it works with Python 2.7, though a few packages are required: + * The package `jsonpath-rw` is required by the main Rel Lab. + * The package `termcolor` is required by the Cirrus Query Debugger. + * If you don't have one of these packages, you can get it with `pip install <package-name>` (`sudo` may be required to install packages). +* SSH access to the host you intend to connect to. + +## Invocation + +The main Rel Lab process is `relevancyRunner.py`, which takes a `.ini` config file (see below): + + relevancyRunner.py -c relevance.ini + +### Processes + +`relevancyRunner.py` parses the `.ini` file (see below), manages configuration, runs the queries against the Elasticsearch cluster and outputs the results, and then delegates diffing the results to the `jsonDiffTool` specified in the `.ini` file, and delegated the final report to the `metricTool` specified in the `.ini` file. It also archives the original queries and configuration (`.ini` and JSON `config` files) with the Rel Lab run output. + +The `jsonDiffTool` is implemented as `jsondiff.py`, "an almost smart enough JSON diff tool". It's actually not that smart: it munges the search results JSON a bit, pretty-prints it, and then uses Python's HtmlDiff to make reasonably pretty output. + +The `metricTool` is implemented as `relcomp.py`, which generates an HTML report comparing two relevance lab query runs. A number of metrics are defined, including zero results rate and a generic top-N diffs (sorted or not). Adding and configuring these metrics can be done in `main`, in the array `myMetrics`. Examples of queries that change from one run to the next for each metric are provided, with links into the diffs created by `jsondiff.py`. + +Running the queries is typically the most time-consuming part of the process. If you ask for a very large number of results for each query (≫100), the diff step can be very slow. The report processing is generally very quick. + +### Configuration + +The Rel Lab is configured by way of an .ini file. A sample, `relevance.ini`, is provided. Global settings are provided in `[settings]`, and config for the two test runs are in `[test1]` and `[test2]`. + +Additional command line arguments can be added to `searchCommand` to affect the way the queries are run (such as what wiki to run against, changing the number of results returned, and including detailed scoring information. + +The number of examples provided by `jsondiff.py` is configurable in the `metricTool` command line. + +See `relevance.ini` for more details on the command line arguments. + +Each `[test#]` contains the `name` of the query set, and the file containing the `queries` (see Input below). Optionally, a JSON `config` file can be provided, which is passed to `runSearch.php` on the command line. These JSON configurations should be formatted as a single line. + +The settings `queries`, `labHost`, `config`, and `searchCommand` can be specified globally under `[settings]` or per-run under `[test#]`. If both exist, `[test#]` will override `[settings]`. + +#### Example JSON configs: + +* `{"wgCirrusSearchFunctionRescoreWindowSize": 1, "wgCirrusSearchPhraseRescoreWindowSize" : 1}` + * Set the Function Rescore Window Size to 1, and set the Phrase Rescore Window Size to 1. + +* `{"wgCirrusSearchAllFields": {"use": false}}` + * Set `$wgCirrusSearchAllFields['use']` to `false`. + +* `{"wgCirrusSearchClusters":{"default": [{"host":"nobelium.eqiad.wmnet", "port":"80"}]}}` + * Forward queries to the Nobelium cluster, which uses non-default port 80. + +## Input + +Queries should be formatted as Unicode text, with one query per line in the file specified under `queries`. Typically, the same queries file would be used by both runs, and the JSON `config` would be the only difference between the runs. + +However, you could have different queries in two different files (e.g., one with quotes and one with the quotes removed). Queries are compared sequentially. That is, the first one in one file is compared to the first one in the other file, etc. + +Query input should not contain tabs. + + +## Output + +By default, Rel Lab run results are written out to the `relevance/` directory. This can be configured under `workDir` under `[settings]` in the `.ini` file. + +A directory for each query set is created in the `relevance/queries/` directory. The directory is a "safe" version of the `name` given under `[test#]`. This directory contains the queries, the results, and a copy of the JSON config file used, if any, under the name `config.json`. + +A directory for each comparison between `[test1]` and `[test2]` is created un the `relevance/comparisons/` directory. The name is a concatenation of the "safe" versions of the `name`s given to the query sets. The original `.ini` file is copied to `config.ini`, the final report is in `report.html`, and the diffs are stored in the `diffs/` directory, and are named in order as `diff#.html`. + + +## Other Tools + +There are a few other bits and bobs included with the Rel Lab. + +### Cirrus Query Debugger + +The Cirrus Query Debugger (`cqd.py`) is a command line tool to display various debugging information for individual queries. + +Run `cqd.py --help` for more details. + +Note that `cqd.py` requires the `termcolor` package. + +Helpful hint: If you want to pipe the output of `cqd.py` through `less`, you will want to use `less`'s `-R` option, which makes it understand and preserve the color output from `cqd.py`, and you might want to use `less`'s `-S` option, which doesn't wrap lines (arrow left and right to see long lines), depending on which part of the output you are using most. + +### Import Indices + +Import Indices (`importindices.py`) downloads elasticsearch indices from wikimedia dumps and imports them to an elasticsearch cluster. It lives with the Rel Lab but is used on the Elasticsearch server you connect to, not your local machine. + +### Miscellaneous + +The `misc/` directory contains additional useful stuff: + +* `fulltextQueriesSample.hql` contains a well-commented example HQL query to run against HIVE to extract a sample query set of fulltext queries. + +### Gerrit Config + +These files help Gerrit process patches correctly and are not directly part of the Rel Lab: + +* `setup.cfg` +* `tox.ini` + +## Options! + +There are lots of options which can be passed as JSON in `config` files, or as options to the Cirrus Query Debugger (specifically, or generally using the custom `-c` option). + +For more details on what the options do, see `CirrusSearch.php` in the [CirrusSearch extension](https://www.mediawiki.org/wiki/Extension:CirrusSearch). + +For reference, here are some options and their names in JSON, Cirrus Query Debugger (CDQ), or the web API (API names are available using `-c` with CDQ). + +* *Phrase Window*—Default: 512; JSON: `wgCirrusSearchPhraseRescoreWindowSize`; CDQ: `-pw`; API: `cirrusPhraseWindow`. + +* *Function Window*—Default: 8196; JSON: `wgCirrusSearchFunctionRescoreWindowSize`; CDQ: `-fw`; API: `cirrusFunctionWindow`. + +* *Rescore Profile*—Default: default; CDQ: `-rp`; + * default: boostlinks and templates by default + optional criteria activated by special syntax (namespaces, prefer-recent, language, ...) + * default_noboostlinks : default minus boostlinks + * empty (will be deployed soon) + +* *All Fields*—Default: true/yes; JSON: `wgCirrusSearchAllFields`; CDQ: `--allField`; API: `cirrusUseAllFields`. + * JSON default: {"use": true} + +* *Phrase Boost*—Default: 10; JSON: `wgCirrusSearchPhraseRescoreBoost`; API: `cirrusPhraseBoost`. + +* *Phrase Slop*—Default: 1; JSON: `wgCirrusSearchPhraseSlop`; API: `cirrusPhraseSlop`. + * API sets `boost` sub-value + * JSON default: {"boost": 1, "precise": 0, "default": 0} + +* *Boost Links*—Default: true/yes; JSON: `wgCirrusSearchBoostLinks`; API: `cirrusBoostLinks`. + +* *Common Terms Query*—Default: false/no; JSON: `wgCirrusSearchUseCommonTermsQuery`; API: `cirrusUseCommonTermsQuery`. + +* *Common Terms Query Profile*—Default: default; API: `cirrusCommonTermsQueryProfile`. + * default: requires 4 terms in the query to be activated + * strict: requires 6 terms in the query to be activated + * aggressive_recall: requires 3 terms in the query to be activated + +See also the "[more like](https://www.mediawiki.org/wiki/Help:CirrusSearch#morelike:)" options. diff --git a/cqd.py b/cqd.py index 2c742f2..523d6b0 100755 --- a/cqd.py +++ b/cqd.py @@ -61,7 +61,6 @@ def __init__(self, args): self.limit = args.limit self.offset = args.offset - self.limit = args.limit self.functionWindow = args.functionWindow self.phraseWindow = args.phraseWindow self.rescoreProfile = args.rescoreProfile diff --git a/relcomp.py b/relcomp.py index 6c902c2..cee3148 100755 --- a/relcomp.py +++ b/relcomp.py @@ -104,20 +104,7 @@ """Add example diff to b2d_diff (delta=False) or d2b_diff (delta=True) """ - query_string = b_query = d_query = "" - - if "query" in b: - b_query = b["query"] - if "query" in d: - d_query = d["query"] - - if b_query == d_query: - query_string = b_query - else: - query_string = u"{} / {}".format(b_query, d_query) - - if query_string == "": - query_string = "[no-query-string]" + query_string = make_query_string(b, d) if delta: self.d2b_diff.append([index, query_string]) @@ -148,13 +135,16 @@ if self.raw_count: ret_string += "<b>{}:</b> {}{}".format(self.name, count, diffstr) else: + q_pct = 100*count/float(self.total_queries) if self.total_queries else 0 ret_string += "<b>{}:</b> {:.1f}%{}".format( - self.name, 100*count/float(self.total_queries), diffstr + self.name, q_pct, diffstr ) - return ret_string + "<br>\n" + ret_string += "<br>\n" + return ret_string.encode('ascii', 'xmlcharrefreplace') elif self.printnum > 0: # diff - ret_string = "<b>{}:</b><br>\n".format(self.name) + ret_string = "<b>{}:</b>\n".format(self.name) + ret_string += toggle_string() printed = 0 if self.printset == "random": # shuffle, unless all will be printed, then don't bother @@ -181,7 +171,8 @@ printed += 1 if printed >= self.printnum: break - return ret_string + "<br>\n" + ret_string += "</span>\n<br>\n" + return ret_string.encode('ascii', 'xmlcharrefreplace') return "" @@ -196,9 +187,10 @@ __metaclass__ = ABCMeta - def __init__(self): + def __init__(self, printnum=20): super(ZeroResultsRate, self).__init__("Zero Results", - symbols=["↓", "↑"]) + symbols=["↓", "↑"], + printnum=printnum) def has_condition(self, x, y): """Simple check: is totalHits == 0? @@ -215,12 +207,12 @@ __metaclass__ = ABCMeta - def __init__(self, topN=5, sorted=False): + def __init__(self, topN=5, sorted=False, printnum=20): sortstr = "Sorted" if sorted else "Unsorted" self.sorted = sorted self.topN = topN super(TopNDiff, self).__init__("Top {} {} Results Differ".format(topN, sortstr), - symmetric=True) + symmetric=True, printnum=printnum) def has_condition(self, x, y): if "totalHits" in x: @@ -263,21 +255,78 @@ return not len(x) == 0 -def print_report(target_dir, diff_count, file1, file2, myMetrics): +def make_query_string(x, y): + query_string = x_query = y_query = "" + + if "query" in x: + x_query = x["query"] + if "query" in y: + y_query = y["query"] + + if x_query == y_query: + query_string = x_query + else: + query_string = u"{} / {}".format(x_query, y_query) + + if query_string == "": + query_string = "[no-query-string]" + + return query_string + + +def print_report(target_dir, diff_count, file1, file2, myMetrics, errors): report_file = open(target_dir + "report.html", "w") report_file.write(textwrap.dedent("""\ + <script> + function toggle (button, span) {{ + sp = document.getElementById(span); + if (sp.style.display == 'none' || sp.style.display == '') {{ + button.innerHTML = '[ – ]'; + sp.style.display = 'inline'; + }} + else {{ + button.innerHTML = '[ + ]'; + sp.style.display = 'none'; + }} + }} + </script> + + <style> + .button {{cursor:pointer}} + .toggle {{display:none}} + </style> + <h2>Comparison run summary: {}</h2> <blockquote> <b>Stats:</b> {} query pairs compared<br> + """).format(target_dir, diff_count)) + + if len(errors): + report_file.write("<br>\n<font color=red><b>QUERY PAIRS WITH ERRORS: " + + "{}</b></font>\n".format(len(errors))) + report_file.write(toggle_string()) + printed = 0 + keylist = errors.keys() + shuffle(keylist) + for e in keylist: + report_file.write(" <font color=red>ERROR</font> " + + "<a href='diffs/diff{}.html'>{}</a><br>\n". + format(e, errors[e].encode('ascii', 'xmlcharrefreplace'))) + printed += 1 + if printed >= 50: + break + report_file.write("</span>\n") + + report_file.write(textwrap.dedent("""\ </blockquote> <h3>Baseline: {}</h3> <blockquote> <b>Metrics:</b><br> - """).format(target_dir, diff_count, file1)) + """).format(file1)) for m in myMetrics: - report_file.write(m.results("baseline").encode('ascii', 'xmlcharrefreplace')) + report_file.write(m.results("baseline")) report_file.write(textwrap.dedent("""\ </blockquote> @@ -288,7 +337,7 @@ """).format(file2)) for m in myMetrics: - report_file.write(m.results("delta").encode('ascii', 'xmlcharrefreplace')) + report_file.write(m.results("delta")) report_file.write(textwrap.dedent("""\ </blockquote> @@ -298,9 +347,16 @@ """)) for m in myMetrics: - report_file.write(m.results().encode('ascii', 'xmlcharrefreplace')) + report_file.write(m.results()) report_file.write("</blockquote>") + + +def toggle_string(): + toggle_string.num += 1 + return("<span onclick='toggle(this,\"toggle{}\")' class=button>".format(toggle_string.num) + + "[ + ]</span><br>\n<span id=toggle{} class=toggle>\n".format(toggle_string.num)) +toggle_string.num = 0 def main(): @@ -311,22 +367,29 @@ parser.add_argument("file", nargs=2, help="files to diff") parser.add_argument("-d", "--dir", dest="dir", default="./comp/", help="output directory, default is ./comp/") + parser.add_argument("-p", "--printnum", dest="printnum", default=20, + help="number of samples per metric, default is 20") args = parser.parse_args() (file1, file2) = args.file target_dir = args.dir + "/" + printnum = int(args.printnum) if not os.path.exists(target_dir): os.makedirs(os.path.dirname(target_dir)) diff_count = 0 + errors = {} # set up metrics + # TODO: make this configurable from the .ini file myMetrics = [ QueryCount(), - ZeroResultsRate(), - TopNDiff(5, sorted=False), - TopNDiff(5, sorted=True) + ZeroResultsRate(printnum=printnum), + TopNDiff(3, sorted=False, printnum=printnum), + TopNDiff(3, sorted=True, printnum=printnum), + TopNDiff(5, sorted=False, printnum=printnum), + TopNDiff(5, sorted=True, printnum=printnum) ] with open(file1) as a, open(file2) as b: @@ -342,10 +405,15 @@ bjson = json.loads(bline) diff_count += 1 + + if 'error' in ajson or 'error' in bjson: + errors[diff_count] = make_query_string(ajson, bjson) + continue + for m in myMetrics: m.measure(ajson, bjson, diff_count) - print_report(target_dir, diff_count, file1, file2, myMetrics) + print_report(target_dir, diff_count, file1, file2, myMetrics, errors) if __name__ == "__main__": diff --git a/relevance.ini b/relevance.ini index d15d57f..3a6b4da 100644 --- a/relevance.ini +++ b/relevance.ini @@ -11,15 +11,19 @@ ; JSON Diff tool jsonDiffTool = python jsondiff.py -d ; Comparison/metric reporting tool -metricTool = python relcomp.py -d +; additional params should go before -d +; -p 100 to set the number of examples printed per metric to 100 (defaults to 20) +metricTool = python relcomp.py -p 20 -d +; queries to be run +queries = test.q [test1] name = Test 1 -queries = test1.q -;config = test1.json +config = test1.json [test2] name = Test 2 -queries = test2.q ;config = test2.json +; labHost, searchCommand, queries, and config can be specified globally under [settings] or locally under [test#]. Local settings override global settings. +; config is optional \ No newline at end of file diff --git a/relevancyRunner.py b/relevancyRunner.py index f416279..9a96433 100755 --- a/relevancyRunner.py +++ b/relevancyRunner.py @@ -39,16 +39,24 @@ qname = getSafeName(config.get(section, 'name')) qdir = config.get('settings', 'workDir') + "/queries/" + qname refreshDir(qdir) - cmdline = config.get('settings', 'searchCommand') + cmdline = config.get(section, 'searchCommand') if config.has_option(section, 'config'): cmdline += " --options " + pipes.quote(open(config.get(section, 'config')).read()) shutil.copyfile(config.get(section, 'config'), qdir + '/config.json') # archive search config runCommand("cat %s | ssh %s %s > %s" % (config.get(section, 'queries'), - config.get('settings', 'labHost'), + config.get(section, 'labHost'), pipes.quote(cmdline), qdir + "/results")) shutil.copyfile(config.get(section, 'queries'), qdir + '/queries') # archive queries return qdir + "/results" + + +def distributeGlobalSettings(config, globals, sections, settings): + # if settings are missing from sections, copy from globals + for sec in sections: + for set in settings: + if not config.has_option(sec, set) and config.has_option(globals, set): + config.set(sec, set, config.get(globals, set)) def checkSettings(config, section, settings): @@ -69,10 +77,11 @@ config = ConfigParser.ConfigParser() config.readfp(open(args.config)) -checkSettings(config, 'settings', ['labHost', 'workDir', 'jsonDiffTool', - 'metricTool', 'searchCommand']) -checkSettings(config, 'test1', ['name', 'queries']) -checkSettings(config, 'test2', ['name', 'queries']) +distributeGlobalSettings(config, 'settings', ['test1', 'test2'], + ['queries', 'labHost', 'searchCommand', 'config']) +checkSettings(config, 'settings', ['workDir', 'jsonDiffTool', 'metricTool']) +checkSettings(config, 'test1', ['name', 'queries', 'labHost', 'searchCommand']) +checkSettings(config, 'test2', ['name', 'queries', 'labHost', 'searchCommand']) res1 = runSearch(config, 'test1') res2 = runSearch(config, 'test2') -- To view, visit https://gerrit.wikimedia.org/r/271356 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ib5ef1717883ddfce1ec8b3cfd6fd2fdf19a86a7f Gerrit-PatchSet: 5 Gerrit-Project: wikimedia/discovery/relevancylab Gerrit-Branch: master Gerrit-Owner: Tjones <[email protected]> Gerrit-Reviewer: DCausse <[email protected]> Gerrit-Reviewer: EBernhardson <[email protected]> Gerrit-Reviewer: Smalyshev <[email protected]> Gerrit-Reviewer: jenkins-bot <> _______________________________________________ MediaWiki-commits mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits
