[MediaWiki-commits] [Gerrit] Improve metrics for RelForge metric tool - change (wikimedia...relevanceForge)

Tjones (Code Review) Fri, 10 Jun 2016 13:41:37 -0700

Tjones has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/293875


Change subject: Improve metrics for RelForge metric tool
......................................................................

Improve metrics for RelForge metric tool

Generalize ZeroResultsRate metric to HitsWithinRange metric,
and instantiate Zero Results Rate (== 0) and Poorly Performing
Percentage (< 3) metrics as HitsWithinRange.

Add "within ±1000" chart for TotalHits to keep massive outliers
(> 100K changes) from overshadowing everything.

Use better bucket sizes for histograms in Charts, N for TopN,
20 by default, 100 for ±1000 chart.

Add ranges [min, max] to stats to give a better sense of outliers.

Change-Id: I812b6aa8684c4464131a1eddefe2c7a6dcfc398b
---
M README.md
M relcomp.py
2 files changed, 50 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/relevanceForge 
refs/changes/75/293875/1

diff --git a/README.md b/README.md
index 2531d6f..96082fe 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
-       ___      __                             ____                 
-      / _ \___ / /__ _  _____ ____  _______   / __/__  _______ ____ 
+       ___      __                             ____
+      / _ \___ / /__ _  _____ ____  _______   / __/__  _______ ____
      / , _/ -_) / -_) |/ / _ `/ _ \/ __/ -_) / _// _ \/ __/ _ `/ -_) *
-    /_/|_|\__/_/\__/|___/\_,_/_//_/\__/\__/ /_/  \___/_/  \_, /\__/ 
+    /_/|_|\__/_/\__/|___/\_,_/_//_/\__/\__/ /_/  \___/_/  \_, /\__/
                                                          /___/
 
 The primary purpose of the Relevance Forge is to allow us<sup>†</sup> to 
experiment with proposed modifications to our search process and gauge their 
effectiveness<sup>‡</sup> and impact<sup>§</sup> before releasing them into 
production, and even before doing any kind of user acceptance or A/B testing. 
Also, testing in the Relevance Forge gives an additional benefit over A/B tests 
(esp. in the case of very targeted changes): with A/B tests we aren't 
necessarily able to test the behavior of the *same query* with two different 
configurations.
@@ -38,7 +38,7 @@
 
 The `jsonDiffTool` is implemented as `jsondiff.py`, "a somewhat smarter search 
result JSON diff tool". This version does an automatic alignment at the level 
of results pages (matching pagIds), munges the JSON results, and does a 
structural diff of the results. Structural elements that differ are marked as 
differing (yellow highlight), but no details are given on the diffs (i.e., only 
binary diffing of leaf nodes of the JSON structure). Changes in position from 
the baseline to delta are marked (e.g., ↑1 (light green) or ↓2 (light red)). 
New items are bright green and marked with "\*". Lost items are bright red and 
marked with "·". Clicking on an item number will display the item in the 
baseline and delta sife-by-side. Diffing results with explanations (i.e., using 
`--explain` in the `searchCommand`) is currently *much* slower, so don't enable 
that unless you are going to use it.
 
-The `metricTool` is implemented as `relcomp.py`, which generates an HTML 
report comparing two Relevance Forge query runs. A number of metrics are 
defined, including zero results rate and a generic top-N diffs (sorted or not). 
Adding and configuring these metrics can be done in `main`, in the array 
`myMetrics`. Examples of queries that change from one run to the next for each 
metric are provided, with links into the diffs created by `jsondiff.py`.
+The `metricTool` is implemented as `relcomp.py`, which generates an HTML 
report comparing two Relevance Forge query runs. A number of metrics are 
defined, including generic metrics based on number of results provided and 
top-N diffs (sorted or not). Adding and configuring these metrics can be done 
in `main`, in the array `myMetrics`. Examples of queries that change from one 
run to the next for each metric are provided, with links into the diffs created 
by `jsondiff.py`.
 
 Running the queries is typically the most time-consuming part of the process. 
If you ask for a very large number of results for each query (≫100), the diff 
step can be very slow. The report processing is generally very quick.
 
@@ -90,7 +90,7 @@
 
 **`QueryCount`** gives a count of queries in each of corpus. It was also a 
convenient place to add statistics and charts (see below) for the number of 
TotalHits (which can be toggled with the `resultscount` parameter). 
`QueryCount` does not show any Diffs (see below).
 
-**`ZeroResultsRate`** calculates the zero results rate for each corpus/config 
combo, and computes the difference between these rates between baseline and 
delta. `ZeroResultsRate` does show Diffs (see below).
+**`HitsWithinRange`** calculates the percentage of queries with results 
between given values for each corpus/config combo, and computes the difference 
between these rates between baseline and delta. `HitsWithinRange` does show 
Diffs (see below). This metric is a generalization of "zero results rate" (0-0 
results) and "poorly performing percentage" (0-2 results), both of which are 
defined by default in terms of `HitsWithinRange`.
 
 **`TopNDiff`** looks at and reports the number of queries with differences in 
the top *n* results returned. *n* can be set to any integer (but shouldn't be 
larger than the number of results requested by the `searchCommand`, or the 
results won't be very meaningful. Differences can be considered `sorted` or 
not; e.g., if `sorted=True`, then swapping the top two results counts as a 
difference, if `sorted=False` then it does not. `TopNDiff` does show Diffs (see 
below).
 
@@ -98,10 +98,12 @@
 
 It makes sense to have multiple `TopNDiff` metrics—e.g., sorted and unsorted 
top 3, 5, 10, and 20—since these different stats tell different stories.
 
-**Statistics and Charts:** When statistics and charts are to be displayed, the 
mean (μ), standard deviation (σ), and median are computed, both for the 
number/count of differences and the percent differences. These can be very 
different or nearly identical. For example, if every query got one more result 
in TotalHits, then that's +1 for every query, but for a query that originally 
had 1 result, it's +100%, but for a query that had 100 results, it's only +1%. 
For results that change from 0, (i.e., from 0 results to 5 results), the 
denominator used is 1 (so 0 to 5 is +500%).
+**Statistics and Charts:** When statistics and charts are to be displayed, the 
mean (μ), standard deviation (σ), median, and range are computed, both for the 
number/count of differences and the percent differences. These can be very 
different or nearly identical. For example, if every query got one more result 
in TotalHits, then that's +1 for every query, but for a query that originally 
had 1 result, it's +100%, but for a query that had 100 results, it's only +1%. 
For results that change from 0, (i.e., from 0 results to 5 results), the 
denominator used is 1 (so 0 to 5 is +500%).
 
 Three charts are currently provided: number/count differences ("All queries, 
by number of changed ——"), number/count differences after dropping all 0 
changes ("Changed queries, by number of changed ——"), and percent differences 
after dropping all 0 changes ("Changed queries, by percent of changed ——"). 
Since a change affecting 40% of queries is a pretty big change, the "0 changes" 
part of the graph often wildly dominates the rest. Dropping them effectively 
allows zooming in on the rest.
 
+A fourth chart is available, currently only shown for TotalHits under the 
Query Count metric: "Changed queries, changed by < 1000, by number of changed 
——". This chart features number/count differences after dropping all 0 changes 
and all changes with a magnitude greater than ±1000. As above, dropping "0 
changes" focuses on the smaller number of changes that are ≠0. Limiting to 
changes in the ±1000 range deals with another issue: sometimes the number of 
TotalHits can change by hundreds of thousands, though usually only for a very 
small number of queries; when that happens, the charts can get broken down into 
buckets with a span of 10,000 or even 100,000, which is not a useful level of 
detail. This chart breaks down a range no bigger than [-1000, 1000] into 100 
buckets, which allows you to look in more detail at the range where most 
changes take place, regardless of any outliers.
+
 Charts are currently automatically generated by matplotlib, and sometimes have 
trouble with scale and outliers. Still, it's nice to get some idea of the 
distribution since the distributions of changes we see are often not normal, 
and thus μ, σ, and median are useful benchmarks, but don't tell the whole story.
 
 The charts are presented in the report scaled fairly small, though they are 
presented in a standard order, and each is a link to the full-sized image.
diff --git a/relcomp.py b/relcomp.py
index b0ae581..5924cdc 100755
--- a/relcomp.py
+++ b/relcomp.py
@@ -193,22 +193,27 @@
         pass
 
 
-class ZeroResultsRate(Metric):
-    """Percentage of queries that return zero results."""
+class HitsWithinRange(Metric):
+    """Percentage of queries that return a number of results within a given 
range (inclusive)."""
 
     __metaclass__ = ABCMeta
 
-    def __init__(self, printnum=20):
-        super(ZeroResultsRate, self).__init__("Zero Results",
+    def __init__(self, name, max, min=0, printnum=20):
+        super(HitsWithinRange, self).__init__(name,
                                               symbols=["&darr;", "&uarr;"],
                                               printnum=printnum)
+        self.max = max
+        self.min = min
 
     def has_condition(self, x, y, is_baseline=False):
-        """Simple check: is totalHits == 0?
+        """Simple check: is min <= totalHits <= max?
         """
         if "totalHits" in x:
-            return x["totalHits"] == 0
-        return 1  # empty JSON mean no hits
+            x_hits = x["totalHits"]
+        else:
+            x_hits = 0  # empty JSON mean no hits
+
+        return x_hits >= self.min and x_hits <= self.max
 
 
 class TopNDiff(Metric):
@@ -231,8 +236,9 @@
         global image_path
         ret_string = super(TopNDiff, self).results(what)
         if what == "delta" and not self.sorted and self.showstats:
-            ret_string += num_num0_pct_chart(self.magnitude, 
"top{}".format(self.topN),
-                                             "Top {} 
Results".format(self.topN))
+            ret_string += make_charts(self.magnitude, 
"top{}".format(self.topN),
+                                      "Top {} Results".format(self.topN),
+                                      bins=self.topN)
         return ret_string
 
     def has_condition(self, x, y, is_baseline=False):
@@ -300,7 +306,8 @@
         global image_path
         ret_string = super(QueryCount, self).results(what)
         if what == "delta" and self.resultscount:
-            ret_string += num_num0_pct_chart(self.magnitude, "querycount", 
"TotalHits")
+            ret_string += make_charts(self.magnitude, "querycount", 
"TotalHits",
+                                      lessThan1000=True)
         return ret_string
 
 
@@ -409,7 +416,7 @@
 toggle_string.num = 0
 
 
-def make_hist(data, file, title="", xlab="", ylab="", bins=0, yformat="", 
xformat=""):
+def make_hist(data, file, title="", xlab="", ylab="", bins=20, yformat="", 
xformat=""):
     plt.clf()
     if bins:
         plt.hist(data, bins)
@@ -436,7 +443,7 @@
     fig.savefig(file)
 
 
-def num_num0_pct_chart(data, file_prefix, label):
+def make_charts(data, file_prefix, label, lessThan1000=False, bins=20):
     ret_string = ""
     num_changed = [x[1] for x in data]
     pct_changed = [x[1]/x[0] if x[0] != 0 else x[1] for x in data]
@@ -444,26 +451,37 @@
     file_num0 = "{}_num0.png".format(file_prefix)
     file_num = "{}_num.png".format(file_prefix)
     file_pct = "{}_pct.png".format(file_prefix)
-    make_hist(num_changed, image_path + file_num0,
+    make_hist(num_changed, image_path + file_num0, bins=bins,
               xlab="Number {} Changed".format(label), ylab="Frequency",
               title="All queries, by number of changed {}".format(label))
-    make_hist([x for x in num_changed if x != 0], image_path + file_num,
+    make_hist([x for x in num_changed if x != 0], image_path + file_num, 
bins=bins,
               xlab="Number {} Changed".format(label), ylab="Frequency",
               title="Changed queries, by number of changed {}".format(label))
-    make_hist([x for x in pct_changed if x != 0], image_path + file_pct,
+    make_hist([x for x in pct_changed if x != 0], image_path + file_pct, 
bins=bins,
               xlab="Percent {} Changed".format(label), ylab="Frequency", 
xformat="pct",
               title="Changed queries, by percent of changed {}".format(label))
     ret_string += indent + "Num {} Changed: &mu;: ".format(label) +\
-        "{:0.2f}; &sigma;: {:0.2f}; median: {:0.2f}<br>\n".format(
-        numpy.mean(num_changed), numpy.std(num_changed), 
numpy.median(num_changed))
+        "{:0.2f}; &sigma;: {:0.2f}; median: {:0.2f}; range: [{:0.0f}, 
{:0.0f}]<br>\n".format(
+        numpy.mean(num_changed), numpy.std(num_changed), 
numpy.median(num_changed),
+        numpy.amin(num_changed), numpy.amax(num_changed))
     ret_string += indent + "Pct {} Changed: &mu;: ".format(label) +\
-        "{:0.1f}%; &sigma;: {:0.1f}%; median: {:0.1f}%<br>\n".format(
-        numpy.mean(pct_changed)*100, numpy.std(pct_changed)*100, 
numpy.median(pct_changed)*100)
+        "{:0.1f}%; &sigma;: {:0.1f}%; median: {:0.1f}%; range: [{:0.2f}%, 
{:0.2f}%]<br>\n".format(
+        numpy.mean(pct_changed)*100, numpy.std(pct_changed)*100, 
numpy.median(pct_changed)*100,
+        numpy.amin(pct_changed)*100, numpy.amax(pct_changed)*100)
     ret_string += indent + "Charts " + toggle_string() + "<br>\n" +\
         indent + "<a href='{0}'><img src='{0}' 
height=125></a>".format(image_dir + file_num0) +\
-        indent + "<a href='{0}'><img src='{0}' 
height=125></a>".format(image_dir + file_num) +\
-        indent + "<a href='{0}'><img src='{0}' 
height=125></a>".format(image_dir + file_pct) +\
-        "</span><br>\n"
+        indent + "<a href='{0}'><img src='{0}' 
height=125></a>".format(image_dir + file_num)
+    if (lessThan1000):
+        file_within100 = "{}_within100.png".format(file_prefix)
+        make_hist([x for x in num_changed if abs(x) < 1000 and x != 0],
+                  image_path + file_within100, bins=100,
+                  xlab="Number {} Changed".format(label), ylab="Frequency",
+                  title="Changed queries, changed by < 1000, by number of 
changed {}".format(label))
+        ret_string += indent + "<a href='{0}'><img src='{0}' height=125></a>".\
+            format(image_dir + file_within100)
+
+    ret_string += indent + "<a href='{0}'><img src='{0}'".format(image_dir + 
file_pct) +\
+        " height=125></a></span><br>\n"
     return ret_string
 
 
@@ -498,7 +516,8 @@
     # TODO: make this configurable from the .ini file
     myMetrics = [
         QueryCount(),
-        ZeroResultsRate(printnum=printnum),
+        HitsWithinRange("Zero Results Rate", 0, 0, printnum=printnum),
+        HitsWithinRange("Poorly Performing Percentage", 2, 0, 
printnum=printnum),
         TopNDiff(3, sorted=True, printnum=printnum),
         TopNDiff(3, sorted=False, printnum=printnum),
         TopNDiff(5, sorted=True, printnum=printnum),

-- 
To view, visit https://gerrit.wikimedia.org/r/293875
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I812b6aa8684c4464131a1eddefe2c7a6dcfc398b
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/discovery/relevanceForge
Gerrit-Branch: master
Gerrit-Owner: Tjones <tjo...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Improve metrics for RelForge metric tool - change (wikimedia...relevanceForge)

Reply via email to