Re: [oae-dev] Care and Feeding of Search Queries

Ray Davis Thu, 26 Apr 2012 10:33:30 -0700

On 4/25/12 9:20 AM, Nicolaas Matthijs wrote:
> Looking ahead, these sound like really good and manageable
> issues/solutions to feed into the upcoming performance work.


Thanks, Nicolaas.

> Part of the measurement environment we're attempting to set up goes
> through all of the search queries under load using a fully loaded
> system, and hopefully having that in place will call these problem
> queries out much quicker in the future ...

Automated scalability and performance tests will track progress on known 
problems and catch known regressions. But they won't stop 
copying-and-pasting of iffy code and they won't notice poor 
interpretations of the UX design. (For example, if performance was the 
only issue with SAKIII-5322, today's pull request would have been a full 
fix. Unfortunately, that code is also functionally broken in a way that 
wasn't immediately obvious to casual testers.)

Our experience to date shows that structured queries (whether they're 
Solr or plain Lucene or SQL) need to receive focused attention from 
knowledgeable human beings. I don't see how that can happen when a query 
combines bits and pieces of code created by separate development teams 
at separate times.

Best,
Ray

>
> Thanks,
> Nicolaas
>
>
> On 25 Apr 2012, at 17:05, Ray Davis wrote:
>
>> Perhaps I've been driven mad, but I wonder if we should start
>> abstracting query details into centralized service locations. (For
>> example, remove code in which client-side Javascript and server-side
>> servlets try to find related content, and replace it by a centralized
>> "relatedContent" end-point.)
>>
>> The OAE (like Sling/Jackrabbit) uses Solr/Lucene for the type of queries
>> we typically do through SQL. Which is certainly viable, but may have
>> encouraged some bad assumptions. Using Hibernate and a IDE to generate
>> data tables directly from Java class definitions doesn't magically
>> guarantee scalable, stable production code. Similarly, SQL-type queries
>> in Solr/Lucene need the sort of focused attention that DBAs give SQL
>> queries.
>>
>> As is, our codebase is full of queries which don't match UX
>> expectations, or queries that are unnecessarily slow, or both. Most of
>> the slow queries I see could be prevented by following two basic
>> principles:
>>
>> * Avoid wildcards. Solr/Lucene is brilliant at Googlesque fuzzy
>> searches, and wildcards are rarely needed for free-form text. For
>> autocompletion of structured data, there are better solutions.
>>
>> * Don't use redundant clauses. New Solr/Lucene search clauses aren't
>> free. If a one-clause query produces the same results as a thirty-clause
>> query, the single-clause will be much faster. If you can find a matching
>> field that doesn't appear much in the index, that will be faster than a
>> multi-valued field associate with every document (e.g., "path").
>>
>> We've known these principles since October 2010,[1] but our logs remain
>> full of violations. Why? Nothing prevents the mistakes from being made.
>> And the mistakes can be hard to catch in review because they combine
>> code written by two different project teams in three different source
>> repositories.
>>
>> Here are four examples I've looked at in the last week:
>>
>> 1) Recently changed files and related content
>>
>> https://jira.sakaiproject.org/browse/SAKIII-5322
>> https://jira.sakaiproject.org/browse/KERN-2685
>>
>> This odd-looking item often shows up in our production logs:
>>
>> *ERROR* GET /var/search/pool/all.infinity.json ... Very slow solr query
>> 1626 ms q=resourceType:sakai/pooled-content AND (content:(* 1234567 *)
>> OR filename:(* 1234567 *) OR tag:(* 1234567 *) OR description:(* 1234567
>> * OR mimeType: * 1234567 *) OR ngram:(* 1234567 *) OR edgengram:(*
>> 1234567 *) OR tag:(* 1234567 *)) ...
>>
>> I count at least six mistakes. Syntactically, this wreck involves both
>> client-side and server-side code. Semantically, it looks like a
>> desperate stab by someone who didn't want to bother server-side
>> developers for a more useful "related content" search. Meanwhile, more
>> useful "find related content" logic actually does exist server-side, but
>> never got matched to this particular client-side development task.
>>
>> 2) My Library
>>
>> https://jira.sakaiproject.org/browse/KERN-2805
>>
>> Another typical "very slow" query:
>>
>> resourceType:sakai/pooled-content AND ((manager:(211159) OR
>> viewer:(211159)) OR (showalways:true AND (...)) AND (content:(*) OR
>> filename:(*) OR tag:(*) OR description:(*) OR path:(*) OR ngram:(*) OR
>> edgengram:(*))
>>
>> Redundant clauses of unnecessary wildcards! The only way it could be
>> worse is if the wildcards were doubled or tripled up.
>>
>> This will look familiar to my fellow old-timers.[1] But it's still
>> easier for developers to make the mistake than not to make the mistake.
>>
>> 3. "Explore / People" - Quality of results
>>
>> https://jira.sakaiproject.org/browse/KERN-2806
>>
>> Let's say I want to find someone named Ali. I type "Ali". "Alicia Keys"
>> and "Adeeb Khalid" both appear above "Ali MacGraw".
>>
>> Let's say I want to find someone whose name might be "Ollie" or "Oliver"
>> or "Olivier," I don't quite remember. I type "ol", hit enter. First on
>> my results list is "Eli Cochran".
>>
>> Let's say I want to find a woman named "Di". I get 82 results, most of
>> which have no visible "di" substring in their listings.
>>
>> After figuring out the cause, the solution seems pretty clear. What
>> bothers me more is that the problem slipped in so easily.
>>
>> 4. Explore People/Groups/Content efficiency and maintainability
>>
>> https://jira.sakaiproject.org/browse/KERN-28074
>>
>> It really is too bad that "name" and "firstName" and "lastName" and
>> "email" and "title" and "tag" have to be specified separately in each
>> people-query. And in fact they don't, because our Solr schema creates a
>> field named "general" which nicely consolidates them. This query:
>>
>> type:u AND resourceType:(authorizable OR profile) AND (general:ali OR
>> edgengram:ali)
>>
>> is equivalent to this query, except for being much faster and simpler:
>>
>> type:u AND resourceType:(authorizable OR profile) AND (name:(ali) OR
>> firstName:(ali) OR lastName:(ali) OR email:(ali) OR title:(ali) OR
>> tag:(ali) OR edgengram:ali)
>>
>> But since no one has reviewed all the seach queries, and since Everyone
>> Can't Know Everything, the "general" field isn't being used much.
>>
>> Best,
>> Ray
>>
>> [1]
>> http://groups.google.com/group/sakai-kernel/browse_thread/thread/2b87415d291e60b1
>>
>> _______________________________________________
>> oae-dev mailing list
>> [email protected]
>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>
>

_______________________________________________
oae-dev mailing list
[email protected]
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Re: [oae-dev] Care and Feeding of Search Queries

Reply via email to