Re: [oae-dev] Care and Feeding of Search Queries

Nicolaas Matthijs Wed, 25 Apr 2012 09:20:42 -0700

Looking ahead, these sound like really good and manageable issues/ 
solutions to feed into the upcoming performance work.


Part of the measurement environment we're attempting to set up goes  
through all of the search queries under load using a fully loaded  
system, and hopefully having that in place will call these problem  
queries out much quicker in the future ...

Thanks,
Nicolaas


On 25 Apr 2012, at 17:05, Ray Davis wrote:

> Perhaps I've been driven mad, but I wonder if we should start
> abstracting query details into centralized service locations. (For
> example, remove code in which client-side Javascript and server-side
> servlets try to find related content, and replace it by a centralized
> "relatedContent" end-point.)
>
> The OAE (like Sling/Jackrabbit) uses Solr/Lucene for the type of  
> queries
> we typically do through SQL. Which is certainly viable, but may have
> encouraged some bad assumptions. Using Hibernate and a IDE to generate
> data tables directly from Java class definitions doesn't magically
> guarantee scalable, stable production code. Similarly, SQL-type  
> queries
> in Solr/Lucene need the sort of focused attention that DBAs give SQL
> queries.
>
> As is, our codebase is full of queries which don't match UX
> expectations, or queries that are unnecessarily slow, or both. Most of
> the slow queries I see could be prevented by following two basic  
> principles:
>
> * Avoid wildcards. Solr/Lucene is brilliant at Googlesque fuzzy
> searches, and wildcards are rarely needed for free-form text. For
> autocompletion of structured data, there are better solutions.
>
> * Don't use redundant clauses. New Solr/Lucene search clauses aren't
> free. If a one-clause query produces the same results as a thirty- 
> clause
> query, the single-clause will be much faster. If you can find a  
> matching
> field that doesn't appear much in the index, that will be faster  
> than a
> multi-valued field associate with every document (e.g., "path").
>
> We've known these principles since October 2010,[1] but our logs  
> remain
> full of violations. Why? Nothing prevents the mistakes from being  
> made.
> And the mistakes can be hard to catch in review because they combine
> code written by two different project teams in three different source
> repositories.
>
> Here are four examples I've looked at in the last week:
>
> 1) Recently changed files and related content
>
> https://jira.sakaiproject.org/browse/SAKIII-5322
> https://jira.sakaiproject.org/browse/KERN-2685
>
> This odd-looking item often shows up in our production logs:
>
> *ERROR* GET /var/search/pool/all.infinity.json ... Very slow solr  
> query
> 1626 ms q=resourceType:sakai/pooled-content AND (content:(* 1234567 *)
> OR filename:(* 1234567 *) OR tag:(* 1234567 *) OR description:(*  
> 1234567
> * OR mimeType: * 1234567 *) OR ngram:(* 1234567 *) OR edgengram:(*
> 1234567 *) OR tag:(* 1234567 *)) ...
>
> I count at least six mistakes. Syntactically, this wreck involves both
> client-side and server-side code. Semantically, it looks like a
> desperate stab by someone who didn't want to bother server-side
> developers for a more useful "related content" search. Meanwhile, more
> useful "find related content" logic actually does exist server-side,  
> but
> never got matched to this particular client-side development task.
>
> 2) My Library
>
> https://jira.sakaiproject.org/browse/KERN-2805
>
> Another typical "very slow" query:
>
> resourceType:sakai/pooled-content AND ((manager:(211159) OR
> viewer:(211159)) OR (showalways:true AND (...)) AND (content:(*) OR
> filename:(*) OR tag:(*) OR description:(*) OR path:(*) OR ngram:(*) OR
> edgengram:(*))
>
> Redundant clauses of unnecessary wildcards! The only way it could be
> worse is if the wildcards were doubled or tripled up.
>
> This will look familiar to my fellow old-timers.[1] But it's still
> easier for developers to make the mistake than not to make the  
> mistake.
>
> 3. "Explore / People" - Quality of results
>
> https://jira.sakaiproject.org/browse/KERN-2806
>
> Let's say I want to find someone named Ali. I type "Ali". "Alicia  
> Keys"
> and "Adeeb Khalid" both appear above "Ali MacGraw".
>
> Let's say I want to find someone whose name might be "Ollie" or  
> "Oliver"
> or "Olivier," I don't quite remember. I type "ol", hit enter. First on
> my results list is "Eli Cochran".
>
> Let's say I want to find a woman named "Di". I get 82 results, most of
> which have no visible "di" substring in their listings.
>
> After figuring out the cause, the solution seems pretty clear. What
> bothers me more is that the problem slipped in so easily.
>
> 4. Explore People/Groups/Content efficiency and maintainability
>
> https://jira.sakaiproject.org/browse/KERN-28074
>
> It really is too bad that "name" and "firstName" and "lastName" and
> "email" and "title" and "tag" have to be specified separately in each
> people-query. And in fact they don't, because our Solr schema  
> creates a
> field named "general" which nicely consolidates them. This query:
>
> type:u AND resourceType:(authorizable OR profile) AND (general:ali OR
> edgengram:ali)
>
> is equivalent to this query, except for being much faster and simpler:
>
> type:u AND resourceType:(authorizable OR profile) AND (name:(ali) OR
> firstName:(ali) OR lastName:(ali) OR email:(ali) OR title:(ali) OR
> tag:(ali) OR edgengram:ali)
>
> But since no one has reviewed all the seach queries, and since  
> Everyone
> Can't Know Everything, the "general" field isn't being used much.
>
> Best,
> Ray
>
> [1]
> http://groups.google.com/group/sakai-kernel/browse_thread/thread/2b87415d291e60b1
> _______________________________________________
> oae-dev mailing list
> [email protected]
> http://collab.sakaiproject.org/mailman/listinfo/oae-dev

_______________________________________________
oae-dev mailing list
[email protected]
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Re: [oae-dev] Care and Feeding of Search Queries

Reply via email to