Re: [MarkLogic Dev General] Registered Query Best Practices

Ron Hitchens Wed, 31 Jul 2013 14:13:24 -0700

   I actually do have a bunch of queries wrapped in a cts:and-query,
not unlike Mikes example (among others).  In some cases these can be
collapsed down to multiple values in one query, in other cases not.
But as I said in my reply to Mike, the real issue is that the cost of
constructing a given query increases with the size of the database/indexes.


---
Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
     +44 7879 358 212 (voice)          http://www.ronsoft.com
     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
"No amount of belief establishes any fact." -Unknown


On Jul 31, 2013, at 8:11 PM, Geert Josten <[email protected]> wrote:

> The size and structure of the query can matter a lot. Michael's example
> shows 1000 query parts, but you'll see a linear growth if you let it
> iterate up to 20000. On the other hand, if you pass in those 20k random
> id's as one large sequence of allowed values into one range query, the
> profile time drops to about 50msec again.. :)
> 
> @Ron, can you make the query smarter? Doing the same with less parts?
> 
> Kind regards,
> Geert
> 
>> -----Oorspronkelijk bericht-----
>> Van: [email protected] [mailto:general-
>> [email protected]] Namens Michael Blakeley
>> Verzonden: woensdag 31 juli 2013 20:20
>> Aan: MarkLogic Developer Discussion
>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best Practices
>> 
>> If that profile is correct, I'd be much more worried about the cts:query
>> constructor. I have a hard time getting that much elapsed time out of a
>> cts:query constructor.
>> 
>> declare variable $q := cts:and-query(
>>  (1 to 1000) ! cts:element-range-query(
>>    xs:QName('does-not-exist'), '=',
>>    xdmp:integer-to-hex(xdmp:random()))) ;
>> 
>> prof:eval('
>>  declare variable $qnode as element() external ;
>>  cts:query($qnode)',
>>  (xs:QName('qnode'),
>>   document { $q }/*))
>> 
>> On my laptop, the profile shows cts:query at 99% of shallow, with
>> anywhere from 5-15 ms total. And that's with 1000 terms, which seems
> like
>> a lot. But I'm testing against a nearly empty database, which might make
> a
>> difference.
>> 
>> Is cts:query still a hotspot if you drop the registered-query code
> entirely?
>> 
>> Is there a particular cts:query term type that triggers this?
>> 
>> Does xdmp:query-meters() show anything indicating database lookups?
>> 
>> -- Mike
>> 
>> On 31 Jul 2013, at 09:37 , Ron Hitchens <[email protected]> wrote:
>> 
>>> 
>>>  So here's a little more color on this, if anyone is still
>>> interested.  When I profile this code, where $query is a fairly
>>> complex serialized query that was previously computed and stored
>>> in a database:
>>> 
>>> declare variable $q1 := cts:registered-query (cts:register (cts:query
>> ($query)), "unfiltered");
>>> 
>>> cts:search (fn:doc(), $q1)[1 to 5]
>>> 
>>> The top two items on the profile output are:
>>> 
>>> Shallow%  Shallow usecs   Deep%  Deep usecs  Expression
>>> 80        125000          90     140000      cts:query($query)
>>> 10         16000         100     156000      cts:registered-query
> (cts:register
>> (cts:query ($query)), "unfiltered")
>>> 
>>>  Time spent on the actual search is so small it rounded to zero.
>>> 
>>>  Doing this repeatedly yields similar timing, so it's not a cold
>>> cache situation or anything like that.
>>> 
>>>  Profiling this:
>>> 
>>> declare variable $q2 := cts:registered-query (9156609332438599120,
>> "unfiltered");
>>> 
>>> cts:search (fn:doc(), $q2)[1 to 5]
>>> 
>>>  Yields times too fast to measure (all rounded to zero)
>>> 
>>>  So, the potentially expensive to create query is being
>>> built every time and possibly being re-registered as well,
>>> given that cts:registered-query is taking a non-trivial amount
>>> of time.
>>> 
>>> On Jul 31, 2013, at 8:38 AM, Ron Hitchens <[email protected]> wrote:
>>> 
>>>> 
>>>> The overall entitlement query on each request is composed
>>>> of many sub-queries, some of which are static and registered,
>>>> some of which are dependent on the current time.  But even the
>>>> static ones are not finite, new ones can be created at any time
>>>> as part of a new entitlement definition.
>>>> 
>>>> I'm working on a scheme to catch and re-register all the
>>>> static queries in a given query tree when a search fails due
>>>> to a missing registration.  That should lazily re-register
>>>> on first use after a server restart as well.
>>>> 
>>>> ---
>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>   +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>   +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>> "No amount of belief establishes any fact." -Unknown
>>>> 
>>>> 
>>>> On Jul 30, 2013, at 8:30 PM, Geert Josten <[email protected]>
>> wrote:
>>>> 
>>>>> Hi Ron,
>>>>> 
>>>>> Are your queries such that you would have a finite number of sub-
>> queries,
>>>>> if you would break them into smaller subparts? Perhaps you can
>> combine
>>>>> multiple registered queries..
>>>>> 
>>>>> Cheers,
>>>>> Geert
>>>>> 
>>>>>> -----Oorspronkelijk bericht-----
>>>>>> Van: [email protected] [mailto:general-
>>>>>> [email protected]] Namens Ron Hitchens
>>>>>> Verzonden: dinsdag 30 juli 2013 2:29
>>>>>> Aan: MarkLogic Developer Discussion
>>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best
>> Practices
>>>>>> 
>>>>>> 
>>>>>> Hi Geert,
>>>>>> 
>>>>>> I've done something before where we stored reg ids in a map for
>>>>>> easy re-use.  In that case, there was a 1:1 correspondence between
>>>>>> the reg id and a meaningful business domain number.  On this
> project
>>>>>> that's not the case.
>>>>>> 
>>>>>> Also, there is not a finite set of queries that need to be
> registered
>>>>>> so it's not feasible to pre-register everything once.  New ones can
> be
>>>>>> created
>>>>>> dynamically.  And the complicated queries are persisted in another
>>>>>> database
>>>>>> and can be referenced later.  This means the queries which should
> be
>>>>>> registered
>>>>>> will persist across server restarts.  Which means there must be a
> way
>> to
>>>>>> register the queries on first use, then make use of those
> registered
>>>>> queries
>>>>>> on subsequent requests.
>>>>>> 
>>>>>> The re-register-before-each-use pattern solves that nicely, but not
>>>>> if
>>>>>> the query construction cost must be re-paid each time.  It looks
> like
>>>>> the
>>>>>> robust solution is going to have to be catching exceptions for
>>>>> unregistered
>>>>>> queries and reconstructing the registrations.  It's a shame because
>> that
>>>>> is
>>>>>> going to add unnecessary complexity to the code.
>>>>>> 
>>>>>> ---
>>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>>  +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>>  +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>>> "No amount of belief establishes any fact." -Unknown
>>>>>> 
>>>>>> 
>>>>>> On Jul 29, 2013, at 8:15 PM, Geert Josten <[email protected]>
>> wrote:
>>>>>> 
>>>>>>> Hi Ron,
>>>>>>> 
>>>>>>> I recently saw a strategy where they deliberately took a different
>>>>>>> approach. In their case the calculation of the queries was not
>>>>>>> straight-forward and could run into 30k search terms.
> Additionally,
>>>>>>> registering the query, and warming up cache by doing one initial
>>>>> search
>>>>>>> after registering each query took most time. They were searching
>>>>> roughly
>>>>>>> 40mln docs. The searches themselves were subsec..
>>>>>>> 
>>>>>>> Their approach was to store all registered query id's somewhere,
> and
>>>>> have
>>>>>>> them readily available at actual search time. They also used a try
>>>>> catch
>>>>>>> to catch unregistered queries, though in their case they shouldn't
>>>>>>> actually occur, and these dramatically pulled down the average on
>>>>>>> performance tests.
>>>>>>> 
>>>>>>> How much chance is there that a query is unregistered, if you
> would
>>>>>>> prepare all queries beforehand?
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Geert
>>>>>>> 
>>>>>>>> -----Oorspronkelijk bericht-----
>>>>>>>> Van: [email protected] [mailto:general-
>>>>>>>> [email protected]] Namens Michael Blakeley
>>>>>>>> Verzonden: maandag 29 juli 2013 21:08
>>>>>>>> Aan: MarkLogic Developer Discussion
>>>>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best
>>>>> Practices
>>>>>>>> 
>>>>>>>> I think you're using registered query as intended. That behavior
>>>>> sounds
>>>>>>> odd
>>>>>>>> to me. I would expect (2) to be cheap, just a hash operation on
> the
>>>>>>> query
>>>>>>>> terms, and I would (3) to be the expensive step.
>>>>>>>> 
>>>>>>>> So I would contact support and see what they think.
>>>>>>>> 
>>>>>>>> -- Mike
>>>>>>>> 
>>>>>>>> On 29 Jul 2013, at 11:03 , Ron Hitchens <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> What is the best practice these days for using registered
>>>>>>>>> queries?  I was under the impression that the pattern should be:
>>>>>>>>> 
>>>>>>>>> 1) Create your query:
>>>>>>>>> $query := cts:and-query ((blah blah blah))
>>>>>>>>> 2) Register it and make a registered query from it in one step:
>>>>>>>>> $reg-query := cts:resistered-query (cts:register ($query),
>>>>>>> "unfiltered")
>>>>>>>>> 3) Use it in a search:
>>>>>>>>> cts:search (fn:doc(), $reg-query)
>>>>>>>>> 
>>>>>>>>> The theory being that if the cts:query described by $query is
>>>>>>>>> already registered, then the registration is essentially a no-op
>>>>>>>>> and you'll get back the same ID.  And doing this every time
> insures
>>>>>>>>> that if the registered query has been evicted for some reason
> then
>>>>>>>>> it's re-registered and all is well.
>>>>>>>>> 
>>>>>>>>> It's a nice theory but seems to be based on the assumption that
>>>>>>>>> creating a cts:query object is very cheap.  Unfortunately, I'm
>>>>> finding
>>>>>>>>> that this is often not the case, especially when there are lots
> of
>>>>>>>>> documents in the database.  I have a test case where performing
>> Step
>>>>> 2
>>>>>>>>> above on a moderately complicated query takes roughly 200ms
>> every
>>>>>>>> time.
>>>>>>>>> Others take even longer and all seem to be proportional to
>> database
>>>>>>> size.
>>>>>>>>> But running Step 3 with cts:registered-query(<regid>) is very,
> very
>>>>>>>>> fast (~0ms).  Re-creating the query for re-registering every
> time is
>>>>>>>>> destroying the benefit of using a registered query.
>>>>>>>>> 
>>>>>>>>> I can obviously save the registration ID obtained from calling
>>>>>>>>> cts:register and then make a cts:registered-query each time, but
>>>>> then
>>>>>>>>> I'm not protected from the query becoming unregistered.  And
>> there
>>>>> is
>>>>>>>>> no lightweight way to test if an ID is still registered.  The
> only
>>>>> way
>>>>>>>>> I know to make this robust is to put a loop and try/catch around
>> the
>>>>>>>>> code that does the search.  But that requires passing along
>> enough
>>>>>>>>> context to re-construct and re-register the queries (there can
> be
>>>>>>>>> dozens of them in this case).  This is obviously a lot harder
> than
>>>>>>>>> building the complex query in one module and then passing it
>> along
>>>>>>>>> to the search code somewhere else.
>>>>>>>>> 
>>>>>>>>> What's the generally accepted best usage pattern for registered
>>>>>>>>> queries?  And is it my imagination or has the cost of running
>>>>> queries
>>>>>>>>> been moving from query evaluation into query construction?
>>>>>>>>> 
>>>>>>>>> Thanks.
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>>>>> +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>>>>> +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>>>>>> "No amount of belief establishes any fact." -Unknown
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> General mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> General mailing list
>>>>>>>> [email protected]
>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>> _______________________________________________
>>>>>>> General mailing list
>>>>>>> [email protected]
>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>> 
>>>>>> _______________________________________________
>>>>>> General mailing list
>>>>>> [email protected]
>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> [email protected]
>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>> 
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Registered Query Best Practices

Reply via email to