[oae-dev] Care and Feeding of Search Queries

Ray Davis Wed, 25 Apr 2012 09:05:19 -0700

Perhaps I've been driven mad, but I wonder if we should start 
abstracting query details into centralized service locations. (For 
example, remove code in which client-side Javascript and server-side 
servlets try to find related content, and replace it by a centralized 
"relatedContent" end-point.)


The OAE (like Sling/Jackrabbit) uses Solr/Lucene for the type of queries 
we typically do through SQL. Which is certainly viable, but may have 
encouraged some bad assumptions. Using Hibernate and a IDE to generate 
data tables directly from Java class definitions doesn't magically 
guarantee scalable, stable production code. Similarly, SQL-type queries 
in Solr/Lucene need the sort of focused attention that DBAs give SQL 
queries.

As is, our codebase is full of queries which don't match UX 
expectations, or queries that are unnecessarily slow, or both. Most of 
the slow queries I see could be prevented by following two basic principles:

* Avoid wildcards. Solr/Lucene is brilliant at Googlesque fuzzy 
searches, and wildcards are rarely needed for free-form text. For 
autocompletion of structured data, there are better solutions.

* Don't use redundant clauses. New Solr/Lucene search clauses aren't 
free. If a one-clause query produces the same results as a thirty-clause 
query, the single-clause will be much faster. If you can find a matching 
field that doesn't appear much in the index, that will be faster than a 
multi-valued field associate with every document (e.g., "path").

We've known these principles since October 2010,[1] but our logs remain 
full of violations. Why? Nothing prevents the mistakes from being made. 
And the mistakes can be hard to catch in review because they combine 
code written by two different project teams in three different source 
repositories.

Here are four examples I've looked at in the last week:

1) Recently changed files and related content

https://jira.sakaiproject.org/browse/SAKIII-5322
https://jira.sakaiproject.org/browse/KERN-2685

This odd-looking item often shows up in our production logs:

*ERROR* GET /var/search/pool/all.infinity.json ... Very slow solr query 
1626 ms q=resourceType:sakai/pooled-content AND (content:(* 1234567 *) 
OR filename:(* 1234567 *) OR tag:(* 1234567 *) OR description:(* 1234567 
* OR mimeType: * 1234567 *) OR ngram:(* 1234567 *) OR edgengram:(* 
1234567 *) OR tag:(* 1234567 *)) ...

I count at least six mistakes. Syntactically, this wreck involves both 
client-side and server-side code. Semantically, it looks like a 
desperate stab by someone who didn't want to bother server-side 
developers for a more useful "related content" search. Meanwhile, more 
useful "find related content" logic actually does exist server-side, but 
never got matched to this particular client-side development task.

2) My Library

https://jira.sakaiproject.org/browse/KERN-2805

Another typical "very slow" query:

resourceType:sakai/pooled-content AND ((manager:(211159) OR 
viewer:(211159)) OR (showalways:true AND (...)) AND (content:(*) OR 
filename:(*) OR tag:(*) OR description:(*) OR path:(*) OR ngram:(*) OR 
edgengram:(*))

Redundant clauses of unnecessary wildcards! The only way it could be 
worse is if the wildcards were doubled or tripled up.

This will look familiar to my fellow old-timers.[1] But it's still 
easier for developers to make the mistake than not to make the mistake.

3. "Explore / People" - Quality of results

https://jira.sakaiproject.org/browse/KERN-2806

Let's say I want to find someone named Ali. I type "Ali". "Alicia Keys" 
and "Adeeb Khalid" both appear above "Ali MacGraw".

Let's say I want to find someone whose name might be "Ollie" or "Oliver" 
or "Olivier," I don't quite remember. I type "ol", hit enter. First on 
my results list is "Eli Cochran".

Let's say I want to find a woman named "Di". I get 82 results, most of 
which have no visible "di" substring in their listings.

After figuring out the cause, the solution seems pretty clear. What 
bothers me more is that the problem slipped in so easily.

4. Explore People/Groups/Content efficiency and maintainability

https://jira.sakaiproject.org/browse/KERN-28074

It really is too bad that "name" and "firstName" and "lastName" and 
"email" and "title" and "tag" have to be specified separately in each 
people-query. And in fact they don't, because our Solr schema creates a 
field named "general" which nicely consolidates them. This query:

type:u AND resourceType:(authorizable OR profile) AND (general:ali OR 
edgengram:ali)

is equivalent to this query, except for being much faster and simpler:

type:u AND resourceType:(authorizable OR profile) AND (name:(ali) OR 
firstName:(ali) OR lastName:(ali) OR email:(ali) OR title:(ali) OR 
tag:(ali) OR edgengram:ali)

But since no one has reviewed all the seach queries, and since Everyone 
Can't Know Everything, the "general" field isn't being used much.

Best,
Ray

[1] 
http://groups.google.com/group/sakai-kernel/browse_thread/thread/2b87415d291e60b1
_______________________________________________
oae-dev mailing list
[email protected]
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

[oae-dev] Care and Feeding of Search Queries

Reply via email to