Looking ahead, these sound like really good and manageable issues/ solutions to feed into the upcoming performance work.
Part of the measurement environment we're attempting to set up goes through all of the search queries under load using a fully loaded system, and hopefully having that in place will call these problem queries out much quicker in the future ... Thanks, Nicolaas On 25 Apr 2012, at 17:05, Ray Davis wrote: > Perhaps I've been driven mad, but I wonder if we should start > abstracting query details into centralized service locations. (For > example, remove code in which client-side Javascript and server-side > servlets try to find related content, and replace it by a centralized > "relatedContent" end-point.) > > The OAE (like Sling/Jackrabbit) uses Solr/Lucene for the type of > queries > we typically do through SQL. Which is certainly viable, but may have > encouraged some bad assumptions. Using Hibernate and a IDE to generate > data tables directly from Java class definitions doesn't magically > guarantee scalable, stable production code. Similarly, SQL-type > queries > in Solr/Lucene need the sort of focused attention that DBAs give SQL > queries. > > As is, our codebase is full of queries which don't match UX > expectations, or queries that are unnecessarily slow, or both. Most of > the slow queries I see could be prevented by following two basic > principles: > > * Avoid wildcards. Solr/Lucene is brilliant at Googlesque fuzzy > searches, and wildcards are rarely needed for free-form text. For > autocompletion of structured data, there are better solutions. > > * Don't use redundant clauses. New Solr/Lucene search clauses aren't > free. If a one-clause query produces the same results as a thirty- > clause > query, the single-clause will be much faster. If you can find a > matching > field that doesn't appear much in the index, that will be faster > than a > multi-valued field associate with every document (e.g., "path"). > > We've known these principles since October 2010,[1] but our logs > remain > full of violations. Why? Nothing prevents the mistakes from being > made. > And the mistakes can be hard to catch in review because they combine > code written by two different project teams in three different source > repositories. > > Here are four examples I've looked at in the last week: > > 1) Recently changed files and related content > > https://jira.sakaiproject.org/browse/SAKIII-5322 > https://jira.sakaiproject.org/browse/KERN-2685 > > This odd-looking item often shows up in our production logs: > > *ERROR* GET /var/search/pool/all.infinity.json ... Very slow solr > query > 1626 ms q=resourceType:sakai/pooled-content AND (content:(* 1234567 *) > OR filename:(* 1234567 *) OR tag:(* 1234567 *) OR description:(* > 1234567 > * OR mimeType: * 1234567 *) OR ngram:(* 1234567 *) OR edgengram:(* > 1234567 *) OR tag:(* 1234567 *)) ... > > I count at least six mistakes. Syntactically, this wreck involves both > client-side and server-side code. Semantically, it looks like a > desperate stab by someone who didn't want to bother server-side > developers for a more useful "related content" search. Meanwhile, more > useful "find related content" logic actually does exist server-side, > but > never got matched to this particular client-side development task. > > 2) My Library > > https://jira.sakaiproject.org/browse/KERN-2805 > > Another typical "very slow" query: > > resourceType:sakai/pooled-content AND ((manager:(211159) OR > viewer:(211159)) OR (showalways:true AND (...)) AND (content:(*) OR > filename:(*) OR tag:(*) OR description:(*) OR path:(*) OR ngram:(*) OR > edgengram:(*)) > > Redundant clauses of unnecessary wildcards! The only way it could be > worse is if the wildcards were doubled or tripled up. > > This will look familiar to my fellow old-timers.[1] But it's still > easier for developers to make the mistake than not to make the > mistake. > > 3. "Explore / People" - Quality of results > > https://jira.sakaiproject.org/browse/KERN-2806 > > Let's say I want to find someone named Ali. I type "Ali". "Alicia > Keys" > and "Adeeb Khalid" both appear above "Ali MacGraw". > > Let's say I want to find someone whose name might be "Ollie" or > "Oliver" > or "Olivier," I don't quite remember. I type "ol", hit enter. First on > my results list is "Eli Cochran". > > Let's say I want to find a woman named "Di". I get 82 results, most of > which have no visible "di" substring in their listings. > > After figuring out the cause, the solution seems pretty clear. What > bothers me more is that the problem slipped in so easily. > > 4. Explore People/Groups/Content efficiency and maintainability > > https://jira.sakaiproject.org/browse/KERN-28074 > > It really is too bad that "name" and "firstName" and "lastName" and > "email" and "title" and "tag" have to be specified separately in each > people-query. And in fact they don't, because our Solr schema > creates a > field named "general" which nicely consolidates them. This query: > > type:u AND resourceType:(authorizable OR profile) AND (general:ali OR > edgengram:ali) > > is equivalent to this query, except for being much faster and simpler: > > type:u AND resourceType:(authorizable OR profile) AND (name:(ali) OR > firstName:(ali) OR lastName:(ali) OR email:(ali) OR title:(ali) OR > tag:(ali) OR edgengram:ali) > > But since no one has reviewed all the seach queries, and since > Everyone > Can't Know Everything, the "general" field isn't being used much. > > Best, > Ray > > [1] > http://groups.google.com/group/sakai-kernel/browse_thread/thread/2b87415d291e60b1 > _______________________________________________ > oae-dev mailing list > [email protected] > http://collab.sakaiproject.org/mailman/listinfo/oae-dev _______________________________________________ oae-dev mailing list [email protected] http://collab.sakaiproject.org/mailman/listinfo/oae-dev
