Re: [MarkLogic Dev General] Registered Query Best Practices

Ron Hitchens Wed, 31 Jul 2013 15:22:35 -0700

   I didn't setup these machines (AWS images) but I think they
have the recommended number of forests for the number of cores.
There are around 6 million documents loaded and should be evenly
distributed across the forests.


   But I want to remain focused on the fact that this idiom:

cts:resistered-query (cts:register ($query))

   Is actually counter-productive if "cts:register ($query)"
is not always cheap when $query has already been registered.
I expect to pay any query creation cost the first time, when
$query is not already registered.  But not every time.

   Forest size or layout should be irrelevant here.  For this
idiom to work it must be cheap for cts:register to check for a
prior registration of $query and quickly return the existing
registration id.  It appears that $query is being reified on
EVERY call to cts:register, not just when it needs to actually
be registered.

   Has something broken here?  Are the queries being reified
now when they weren't before?  Is this a regression or has it
always worked like this?

---
Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
     +44 7879 358 212 (voice)          http://www.ronsoft.com
     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
"No amount of belief establishes any fact." -Unknown

On Jul 31, 2013, at 10:21 PM, Danny Sokolsky <[email protected]> 
wrote:

> OK, if it is range queries that are at play here, then it might be 
> interesting to look at how big are your forests?  It is possible that adding 
> more forests might increase your parallelism and make each forest's part of 
> the index resolution smaller.  This is especially true with Range Index 
> operations where there is a lot of data in each forest, because the range 
> index files are memory mapped.
> 
> How many documents are in each forest, and how many forests do you have?
> 
> -Danny
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Ron Hitchens
> Sent: Wednesday, July 31, 2013 2:13 PM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Registered Query Best Practices
> 
> 
>   I actually do have a bunch of queries wrapped in a cts:and-query, not 
> unlike Mikes example (among others).  In some cases these can be collapsed 
> down to multiple values in one query, in other cases not.
> But as I said in my reply to Mike, the real issue is that the cost of 
> constructing a given query increases with the size of the database/indexes.
> 
> ---
> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>     +44 7879 358 212 (voice)          http://www.ronsoft.com
>     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> "No amount of belief establishes any fact." -Unknown
> 
> 
> On Jul 31, 2013, at 8:11 PM, Geert Josten <[email protected]> wrote:
> 
>> The size and structure of the query can matter a lot. Michael's 
>> example shows 1000 query parts, but you'll see a linear growth if you 
>> let it iterate up to 20000. On the other hand, if you pass in those 
>> 20k random id's as one large sequence of allowed values into one range 
>> query, the profile time drops to about 50msec again.. :)
>> 
>> @Ron, can you make the query smarter? Doing the same with less parts?
>> 
>> Kind regards,
>> Geert
>> 
>>> -----Oorspronkelijk bericht-----
>>> Van: [email protected] [mailto:general- 
>>> [email protected]] Namens Michael Blakeley
>>> Verzonden: woensdag 31 juli 2013 20:20
>>> Aan: MarkLogic Developer Discussion
>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best 
>>> Practices
>>> 
>>> If that profile is correct, I'd be much more worried about the 
>>> cts:query constructor. I have a hard time getting that much elapsed 
>>> time out of a cts:query constructor.
>>> 
>>> declare variable $q := cts:and-query(
>>> (1 to 1000) ! cts:element-range-query(
>>>   xs:QName('does-not-exist'), '=',
>>>   xdmp:integer-to-hex(xdmp:random()))) ;
>>> 
>>> prof:eval('
>>> declare variable $qnode as element() external ;  cts:query($qnode)',  
>>> (xs:QName('qnode'),
>>>  document { $q }/*))
>>> 
>>> On my laptop, the profile shows cts:query at 99% of shallow, with 
>>> anywhere from 5-15 ms total. And that's with 1000 terms, which seems
>> like
>>> a lot. But I'm testing against a nearly empty database, which might 
>>> make
>> a
>>> difference.
>>> 
>>> Is cts:query still a hotspot if you drop the registered-query code
>> entirely?
>>> 
>>> Is there a particular cts:query term type that triggers this?
>>> 
>>> Does xdmp:query-meters() show anything indicating database lookups?
>>> 
>>> -- Mike
>>> 
>>> On 31 Jul 2013, at 09:37 , Ron Hitchens <[email protected]> wrote:
>>> 
>>>> 
>>>> So here's a little more color on this, if anyone is still 
>>>> interested.  When I profile this code, where $query is a fairly 
>>>> complex serialized query that was previously computed and stored in 
>>>> a database:
>>>> 
>>>> declare variable $q1 := cts:registered-query (cts:register 
>>>> (cts:query
>>> ($query)), "unfiltered");
>>>> 
>>>> cts:search (fn:doc(), $q1)[1 to 5]
>>>> 
>>>> The top two items on the profile output are:
>>>> 
>>>> Shallow%  Shallow usecs   Deep%  Deep usecs  Expression
>>>> 80        125000          90     140000      cts:query($query)
>>>> 10         16000         100     156000      cts:registered-query
>> (cts:register
>>> (cts:query ($query)), "unfiltered")
>>>> 
>>>> Time spent on the actual search is so small it rounded to zero.
>>>> 
>>>> Doing this repeatedly yields similar timing, so it's not a cold 
>>>> cache situation or anything like that.
>>>> 
>>>> Profiling this:
>>>> 
>>>> declare variable $q2 := cts:registered-query (9156609332438599120,
>>> "unfiltered");
>>>> 
>>>> cts:search (fn:doc(), $q2)[1 to 5]
>>>> 
>>>> Yields times too fast to measure (all rounded to zero)
>>>> 
>>>> So, the potentially expensive to create query is being built every 
>>>> time and possibly being re-registered as well, given that 
>>>> cts:registered-query is taking a non-trivial amount of time.
>>>> 
>>>> On Jul 31, 2013, at 8:38 AM, Ron Hitchens <[email protected]> wrote:
>>>> 
>>>>> 
>>>>> The overall entitlement query on each request is composed of many 
>>>>> sub-queries, some of which are static and registered, some of which 
>>>>> are dependent on the current time.  But even the static ones are 
>>>>> not finite, new ones can be created at any time as part of a new 
>>>>> entitlement definition.
>>>>> 
>>>>> I'm working on a scheme to catch and re-register all the static 
>>>>> queries in a given query tree when a search fails due to a missing 
>>>>> registration.  That should lazily re-register on first use after a 
>>>>> server restart as well.
>>>>> 
>>>>> ---
>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>  +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>  +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>> "No amount of belief establishes any fact." -Unknown
>>>>> 
>>>>> 
>>>>> On Jul 30, 2013, at 8:30 PM, Geert Josten <[email protected]>
>>> wrote:
>>>>> 
>>>>>> Hi Ron,
>>>>>> 
>>>>>> Are your queries such that you would have a finite number of sub-
>>> queries,
>>>>>> if you would break them into smaller subparts? Perhaps you can
>>> combine
>>>>>> multiple registered queries..
>>>>>> 
>>>>>> Cheers,
>>>>>> Geert
>>>>>> 
>>>>>>> -----Oorspronkelijk bericht-----
>>>>>>> Van: [email protected] [mailto:general- 
>>>>>>> [email protected]] Namens Ron Hitchens
>>>>>>> Verzonden: dinsdag 30 juli 2013 2:29
>>>>>>> Aan: MarkLogic Developer Discussion
>>>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best
>>> Practices
>>>>>>> 
>>>>>>> 
>>>>>>> Hi Geert,
>>>>>>> 
>>>>>>> I've done something before where we stored reg ids in a map for 
>>>>>>> easy re-use.  In that case, there was a 1:1 correspondence 
>>>>>>> between the reg id and a meaningful business domain number.  On 
>>>>>>> this
>> project
>>>>>>> that's not the case.
>>>>>>> 
>>>>>>> Also, there is not a finite set of queries that need to be
>> registered
>>>>>>> so it's not feasible to pre-register everything once.  New ones 
>>>>>>> can
>> be
>>>>>>> created
>>>>>>> dynamically.  And the complicated queries are persisted in 
>>>>>>> another database and can be referenced later.  This means the 
>>>>>>> queries which should
>> be
>>>>>>> registered
>>>>>>> will persist across server restarts.  Which means there must be a
>> way
>>> to
>>>>>>> register the queries on first use, then make use of those
>> registered
>>>>>> queries
>>>>>>> on subsequent requests.
>>>>>>> 
>>>>>>> The re-register-before-each-use pattern solves that nicely, but 
>>>>>>> not
>>>>>> if
>>>>>>> the query construction cost must be re-paid each time.  It looks
>> like
>>>>>> the
>>>>>>> robust solution is going to have to be catching exceptions for
>>>>>> unregistered
>>>>>>> queries and reconstructing the registrations.  It's a shame 
>>>>>>> because
>>> that
>>>>>> is
>>>>>>> going to add unnecessary complexity to the code.
>>>>>>> 
>>>>>>> ---
>>>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>>> +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>>> +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>>>> "No amount of belief establishes any fact." -Unknown
>>>>>>> 
>>>>>>> 
>>>>>>> On Jul 29, 2013, at 8:15 PM, Geert Josten <[email protected]>
>>> wrote:
>>>>>>> 
>>>>>>>> Hi Ron,
>>>>>>>> 
>>>>>>>> I recently saw a strategy where they deliberately took a 
>>>>>>>> different approach. In their case the calculation of the queries 
>>>>>>>> was not straight-forward and could run into 30k search terms.
>> Additionally,
>>>>>>>> registering the query, and warming up cache by doing one initial
>>>>>> search
>>>>>>>> after registering each query took most time. They were searching
>>>>>> roughly
>>>>>>>> 40mln docs. The searches themselves were subsec..
>>>>>>>> 
>>>>>>>> Their approach was to store all registered query id's somewhere,
>> and
>>>>>> have
>>>>>>>> them readily available at actual search time. They also used a 
>>>>>>>> try
>>>>>> catch
>>>>>>>> to catch unregistered queries, though in their case they 
>>>>>>>> shouldn't actually occur, and these dramatically pulled down the 
>>>>>>>> average on performance tests.
>>>>>>>> 
>>>>>>>> How much chance is there that a query is unregistered, if you
>> would
>>>>>>>> prepare all queries beforehand?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Geert
>>>>>>>> 
>>>>>>>>> -----Oorspronkelijk bericht-----
>>>>>>>>> Van: [email protected] [mailto:general- 
>>>>>>>>> [email protected]] Namens Michael Blakeley
>>>>>>>>> Verzonden: maandag 29 juli 2013 21:08
>>>>>>>>> Aan: MarkLogic Developer Discussion
>>>>>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best
>>>>>> Practices
>>>>>>>>> 
>>>>>>>>> I think you're using registered query as intended. That 
>>>>>>>>> behavior
>>>>>> sounds
>>>>>>>> odd
>>>>>>>>> to me. I would expect (2) to be cheap, just a hash operation on
>> the
>>>>>>>> query
>>>>>>>>> terms, and I would (3) to be the expensive step.
>>>>>>>>> 
>>>>>>>>> So I would contact support and see what they think.
>>>>>>>>> 
>>>>>>>>> -- Mike
>>>>>>>>> 
>>>>>>>>> On 29 Jul 2013, at 11:03 , Ron Hitchens <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> What is the best practice these days for using registered 
>>>>>>>>>> queries?  I was under the impression that the pattern should be:
>>>>>>>>>> 
>>>>>>>>>> 1) Create your query:
>>>>>>>>>> $query := cts:and-query ((blah blah blah))
>>>>>>>>>> 2) Register it and make a registered query from it in one step:
>>>>>>>>>> $reg-query := cts:resistered-query (cts:register ($query),
>>>>>>>> "unfiltered")
>>>>>>>>>> 3) Use it in a search:
>>>>>>>>>> cts:search (fn:doc(), $reg-query)
>>>>>>>>>> 
>>>>>>>>>> The theory being that if the cts:query described by $query is 
>>>>>>>>>> already registered, then the registration is essentially a 
>>>>>>>>>> no-op and you'll get back the same ID.  And doing this every 
>>>>>>>>>> time
>> insures
>>>>>>>>>> that if the registered query has been evicted for some reason
>> then
>>>>>>>>>> it's re-registered and all is well.
>>>>>>>>>> 
>>>>>>>>>> It's a nice theory but seems to be based on the assumption 
>>>>>>>>>> that creating a cts:query object is very cheap.  
>>>>>>>>>> Unfortunately, I'm
>>>>>> finding
>>>>>>>>>> that this is often not the case, especially when there are 
>>>>>>>>>> lots
>> of
>>>>>>>>>> documents in the database.  I have a test case where 
>>>>>>>>>> performing
>>> Step
>>>>>> 2
>>>>>>>>>> above on a moderately complicated query takes roughly 200ms
>>> every
>>>>>>>>> time.
>>>>>>>>>> Others take even longer and all seem to be proportional to
>>> database
>>>>>>>> size.
>>>>>>>>>> But running Step 3 with cts:registered-query(<regid>) is very,
>> very
>>>>>>>>>> fast (~0ms).  Re-creating the query for re-registering every
>> time is
>>>>>>>>>> destroying the benefit of using a registered query.
>>>>>>>>>> 
>>>>>>>>>> I can obviously save the registration ID obtained from calling 
>>>>>>>>>> cts:register and then make a cts:registered-query each time, 
>>>>>>>>>> but
>>>>>> then
>>>>>>>>>> I'm not protected from the query becoming unregistered.  And
>>> there
>>>>>> is
>>>>>>>>>> no lightweight way to test if an ID is still registered.  The
>> only
>>>>>> way
>>>>>>>>>> I know to make this robust is to put a loop and try/catch 
>>>>>>>>>> around
>>> the
>>>>>>>>>> code that does the search.  But that requires passing along
>>> enough
>>>>>>>>>> context to re-construct and re-register the queries (there can
>> be
>>>>>>>>>> dozens of them in this case).  This is obviously a lot harder
>> than
>>>>>>>>>> building the complex query in one module and then passing it
>>> along
>>>>>>>>>> to the search code somewhere else.
>>>>>>>>>> 
>>>>>>>>>> What's the generally accepted best usage pattern for 
>>>>>>>>>> registered queries?  And is it my imagination or has the cost 
>>>>>>>>>> of running
>>>>>> queries
>>>>>>>>>> been moving from query evaluation into query construction?
>>>>>>>>>> 
>>>>>>>>>> Thanks.
>>>>>>>>>> 
>>>>>>>>>> ---
>>>>>>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>>>>>> +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>>>>>> +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>>>>>>> "No amount of belief establishes any fact." -Unknown
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> General mailing list
>>>>>>>>>> [email protected] 
>>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> General mailing list
>>>>>>>>> [email protected] 
>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>>> _______________________________________________
>>>>>>>> General mailing list
>>>>>>>> [email protected]
>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> General mailing list
>>>>>>> [email protected]
>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>> _______________________________________________
>>>>>> General mailing list
>>>>>> [email protected]
>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>> 
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Registered Query Best Practices

Reply via email to