Perhaps I've been driven mad, but I wonder if we should start abstracting query details into centralized service locations. (For example, remove code in which client-side Javascript and server-side servlets try to find related content, and replace it by a centralized "relatedContent" end-point.)
The OAE (like Sling/Jackrabbit) uses Solr/Lucene for the type of queries we typically do through SQL. Which is certainly viable, but may have encouraged some bad assumptions. Using Hibernate and a IDE to generate data tables directly from Java class definitions doesn't magically guarantee scalable, stable production code. Similarly, SQL-type queries in Solr/Lucene need the sort of focused attention that DBAs give SQL queries. As is, our codebase is full of queries which don't match UX expectations, or queries that are unnecessarily slow, or both. Most of the slow queries I see could be prevented by following two basic principles: * Avoid wildcards. Solr/Lucene is brilliant at Googlesque fuzzy searches, and wildcards are rarely needed for free-form text. For autocompletion of structured data, there are better solutions. * Don't use redundant clauses. New Solr/Lucene search clauses aren't free. If a one-clause query produces the same results as a thirty-clause query, the single-clause will be much faster. If you can find a matching field that doesn't appear much in the index, that will be faster than a multi-valued field associate with every document (e.g., "path"). We've known these principles since October 2010,[1] but our logs remain full of violations. Why? Nothing prevents the mistakes from being made. And the mistakes can be hard to catch in review because they combine code written by two different project teams in three different source repositories. Here are four examples I've looked at in the last week: 1) Recently changed files and related content https://jira.sakaiproject.org/browse/SAKIII-5322 https://jira.sakaiproject.org/browse/KERN-2685 This odd-looking item often shows up in our production logs: *ERROR* GET /var/search/pool/all.infinity.json ... Very slow solr query 1626 ms q=resourceType:sakai/pooled-content AND (content:(* 1234567 *) OR filename:(* 1234567 *) OR tag:(* 1234567 *) OR description:(* 1234567 * OR mimeType: * 1234567 *) OR ngram:(* 1234567 *) OR edgengram:(* 1234567 *) OR tag:(* 1234567 *)) ... I count at least six mistakes. Syntactically, this wreck involves both client-side and server-side code. Semantically, it looks like a desperate stab by someone who didn't want to bother server-side developers for a more useful "related content" search. Meanwhile, more useful "find related content" logic actually does exist server-side, but never got matched to this particular client-side development task. 2) My Library https://jira.sakaiproject.org/browse/KERN-2805 Another typical "very slow" query: resourceType:sakai/pooled-content AND ((manager:(211159) OR viewer:(211159)) OR (showalways:true AND (...)) AND (content:(*) OR filename:(*) OR tag:(*) OR description:(*) OR path:(*) OR ngram:(*) OR edgengram:(*)) Redundant clauses of unnecessary wildcards! The only way it could be worse is if the wildcards were doubled or tripled up. This will look familiar to my fellow old-timers.[1] But it's still easier for developers to make the mistake than not to make the mistake. 3. "Explore / People" - Quality of results https://jira.sakaiproject.org/browse/KERN-2806 Let's say I want to find someone named Ali. I type "Ali". "Alicia Keys" and "Adeeb Khalid" both appear above "Ali MacGraw". Let's say I want to find someone whose name might be "Ollie" or "Oliver" or "Olivier," I don't quite remember. I type "ol", hit enter. First on my results list is "Eli Cochran". Let's say I want to find a woman named "Di". I get 82 results, most of which have no visible "di" substring in their listings. After figuring out the cause, the solution seems pretty clear. What bothers me more is that the problem slipped in so easily. 4. Explore People/Groups/Content efficiency and maintainability https://jira.sakaiproject.org/browse/KERN-28074 It really is too bad that "name" and "firstName" and "lastName" and "email" and "title" and "tag" have to be specified separately in each people-query. And in fact they don't, because our Solr schema creates a field named "general" which nicely consolidates them. This query: type:u AND resourceType:(authorizable OR profile) AND (general:ali OR edgengram:ali) is equivalent to this query, except for being much faster and simpler: type:u AND resourceType:(authorizable OR profile) AND (name:(ali) OR firstName:(ali) OR lastName:(ali) OR email:(ali) OR title:(ali) OR tag:(ali) OR edgengram:ali) But since no one has reviewed all the seach queries, and since Everyone Can't Know Everything, the "general" field isn't being used much. Best, Ray [1] http://groups.google.com/group/sakai-kernel/browse_thread/thread/2b87415d291e60b1 _______________________________________________ oae-dev mailing list [email protected] http://collab.sakaiproject.org/mailman/listinfo/oae-dev
