Hi Trey, 

Cool analysis. I'm curious whether the infrastructure let's you look at query 
sessions---- do these queries with special symbols occur late in a multi-query 
sequence that included simpler versions earlier in the sequence?  

Maybe you can segment users who are confused about the query language versus 
power users who are iteratively enhancing a query. The latter seems likely to 
generate low-result-count queries that are more acceptable because the user up 
twisted the query intentionally. 


John


Sent from +1-617-899-2066

> On May 27, 2016, at 5:17 PM, Trey Jones <[email protected]> wrote:
> 
> Hi everyone,
> 
> Mikhail, Data Analyst Extraordinaire, recently published his report, "From 
> Zero to Hero"[1] on the relationship between various features of queries as 
> strings (rather than the content of the query) and those queries getting no 
> results.
> 
> Today for my 10% project I took a quick look at the two most impactful 
> features, quotes and question marks. These two features stood out in 
> Mikhail's report as having both relatively high volume and a relatively 
> higher chance of getting no results.
> 
> I'm not planning on doing a more formal report right now, though I will 
> probably copy this email to my Notes page.
> 
> Quotes make sense, as we try to get an exact match for strings inside quotes, 
> which limits our options for making a match. Question marks are actually a 
> little-known, little-used, poorly documented, and poorly understood wildcard: 
> they stand for any single character. Most users use them to ask questions.
> 
> I took a random sample of 50,000 English Wikipedia queries (using my 
> now-favorite criteria at [2]—basically, full text queries from normal humans 
> (as best as we can tell) with fewer than 3 results). I extracted all the 
> queries with quotes (170) and all the queries that ended in question marks, 
> that is, looked like questions (274). There were 4 queries that were all 
> questions and spaces (e.g., ???? ???????? ????)—they caused problems as they 
> are very expensive queries that repeatedly failed on the test cluster, so I 
> discarded them. I also took a random sub-sample of 1K queries from the larger 
> sample of 50K.
> 
> All samples had plenty of gibberish queries (e.g., "fhdsfhsdjkfgdsjklgsdl"?), 
> queries in other languages, and the other usual cruft.
> 
> For the sample with quotes, I used Relevance Forge to compare the results of 
> running queries as is vs replacing quotes with spaces. The summary stats are 
> below. The zero results rate for queries with quotes went down by almost 
> half, and more than half of queries has changes in their top 5 results. The 
> TotalHits stats are wildly skewed by one query that increased it's results by 
> over 300,000. (There always seems to be an outlier!)
> 
> Metrics:
>    Query Count: 170
>       Num TotalHits Changed: μ: 3049.99; σ: 26435.14; median: 1.00
> 
>    Zero Results: 38.2% (-37.1%)
>    Top 5 Sorted Results Differ: 51.8%
>    Top 5 Unsorted Results Differ: 51.2%
>       Num Top 5 Results Changed: μ: 2.14; σ: 2.30; median: 1.00
> 
> For the sample with question marks, I used Relevance Forge to compare the 
> results of running queries as is vs dropping all trailing question marks and 
> spaces. Some queries ended in multiple question marks (removed), and some 
> queries had other question marks in the middle of the query (kept).  The 
> summary stats are below. The summary is similar to those with quotes: almost 
> half of the zero results queries got results, and more than half of all 
> queries had changes to their top 5 results, and the mean number of total hits 
> is blown out by one query that got more than 300K additional results.
> 
> Metrics:
>    Query Count: 274
>       Num TotalHits Changed: μ: 1875.48; σ: 19885.60; median: 1.00
> 
>    Zero Results: 43.1% (-39.1%)
>    Top 5 Sorted Results Differ: 53.3%
>    Top 5 Unsorted Results Differ: 53.3%
>       Num Top 5 Results Changed: μ: 2.22; σ: 2.33; median: 1.00
> 
> For the 1K sample query, I used Relevance Forge to compare the results of 
> running queries as is vs (a) replacing quotes with spaces, (b) dropping all 
> trailing question marks and spaces, and (c) doing both (there are even a very 
> few queries with both quotes and trailing question marks!).
> 
> Keep in mind that these are all poorly performing queries (fewer than 3 
> results). Summary results:
> 
> (a) quotes
> Metrics:
>    Query Count: 1000
>       Num TotalHits Changed: μ: 0.31; σ: 9.70; median: 0.00
>    Zero Results: 79.5% (-0.1%)
>    Top 5 Sorted Results Differ: 0.1%
>    Top 5 Unsorted Results Differ: 0.1%
>       Num Top 5 Results Changed: μ: 0.01; σ: 0.16; median: 0.00
> 
> (b) question marks
> Metrics:
>    Query Count: 1000
>       Num TotalHits Changed: μ: 0.16; σ: 3.45; median: 0.00
>    Zero Results: 79.4% (-0.2%)
>    Top 5 Sorted Results Differ: 0.4%
>    Top 5 Unsorted Results Differ: 0.4%
>       Num Top 5 Results Changed: μ: 0.02; σ: 0.32; median: 0.00
> 
> (c) quotes and question marks (pretty much the sum of the previous two!)
> Metrics:
>    Query Count: 1000
>       Num TotalHits Changed: μ: 0.47; σ: 10.30; median: 0.00
>    Zero Results: 79.3% (-0.3%)
>    Top 5 Sorted Results Differ: 0.5%
>    Top 5 Unsorted Results Differ: 0.5%
>       Num Top 5 Results Changed: μ: 0.03; σ: 0.35; median: 0.00
> 
> Overall, it's a pretty small effect, and a lot of the results are not always 
> great when quotes are dropped, but it's a very small effort to make the 
> change.
> 
> A quick look at the queries with question marks didn't show any that were 
> obviously intended to be used as wildcards (except maybe all-question-marks, 
> like ????—but who knows what that is supposed to be?).
> 
> It has been suggested before and I would also now recommend disabling ? as a 
> wildcard—it causes many more problems than it solves.
> 
> Re-running poor-performing queries that have quotes without the quotes is an 
> easy win. We should do that too!
> 
> 
> Thoughts, comments, and suggestions welcome!
> 
> —Trey
> 
> [1] 
> https://github.com/wikimedia-research/Discovery-Search-Adhoc-QueryFeatures/blob/master/report.pdf
> [2] 
> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_frwiki_eswiki_itwiki_and_dewiki#Random_sampling
> 
> 
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to