Re: [MarkLogic Dev General] Registered Query Best Practices

Ron Hitchens Wed, 31 Jul 2013 16:06:15 -0700

   Someone from MarkLogic is already engaged onsite to look at this
and other issues, so it will be sent through the appropriate channels.


   My reason for hashing it out here is to discover if others out
there has been using registered queries in this way and if anyone has
encountered similar problems.  I've run across similar situations on
two projects now, I want to find out if anyone else has run into it
as well.

---
Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
     +44 7879 358 212 (voice)          http://www.ronsoft.com
     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
"No amount of belief establishes any fact." -Unknown


On Jul 31, 2013, at 11:27 PM, Michael Blakeley <[email protected]> wrote:

> It would be fun to speculate, but I think it would be more production to 
> contact support with those questions.
> 
> -- Mike
> 
> On 31 Jul 2013, at 15:22 , Ron Hitchens <[email protected]> wrote:
> 
>> 
>>  I didn't setup these machines (AWS images) but I think they
>> have the recommended number of forests for the number of cores.
>> There are around 6 million documents loaded and should be evenly
>> distributed across the forests.
>> 
>>  But I want to remain focused on the fact that this idiom:
>> 
>> cts:resistered-query (cts:register ($query))
>> 
>>  Is actually counter-productive if "cts:register ($query)"
>> is not always cheap when $query has already been registered.
>> I expect to pay any query creation cost the first time, when
>> $query is not already registered.  But not every time.
>> 
>>  Forest size or layout should be irrelevant here.  For this
>> idiom to work it must be cheap for cts:register to check for a
>> prior registration of $query and quickly return the existing
>> registration id.  It appears that $query is being reified on
>> EVERY call to cts:register, not just when it needs to actually
>> be registered.
>> 
>>  Has something broken here?  Are the queries being reified
>> now when they weren't before?  Is this a regression or has it
>> always worked like this?
>> 
>> ---
>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>    +44 7879 358 212 (voice)          http://www.ronsoft.com
>>    +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>> "No amount of belief establishes any fact." -Unknown
>> 
>> On Jul 31, 2013, at 10:21 PM, Danny Sokolsky <[email protected]> 
>> wrote:
>> 
>>> OK, if it is range queries that are at play here, then it might be 
>>> interesting to look at how big are your forests?  It is possible that 
>>> adding more forests might increase your parallelism and make each forest's 
>>> part of the index resolution smaller.  This is especially true with Range 
>>> Index operations where there is a lot of data in each forest, because the 
>>> range index files are memory mapped.
>>> 
>>> How many documents are in each forest, and how many forests do you have?
>>> 
>>> -Danny
>>> 
>>> -----Original Message-----
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf Of Ron Hitchens
>>> Sent: Wednesday, July 31, 2013 2:13 PM
>>> To: MarkLogic Developer Discussion
>>> Subject: Re: [MarkLogic Dev General] Registered Query Best Practices
>>> 
>>> 
>>> I actually do have a bunch of queries wrapped in a cts:and-query, not 
>>> unlike Mikes example (among others).  In some cases these can be collapsed 
>>> down to multiple values in one query, in other cases not.
>>> But as I said in my reply to Mike, the real issue is that the cost of 
>>> constructing a given query increases with the size of the database/indexes.
>>> 
>>> ---
>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>   +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>   +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>> "No amount of belief establishes any fact." -Unknown
>>> 
>>> 
>>> On Jul 31, 2013, at 8:11 PM, Geert Josten <[email protected]> wrote:
>>> 
>>>> The size and structure of the query can matter a lot. Michael's 
>>>> example shows 1000 query parts, but you'll see a linear growth if you 
>>>> let it iterate up to 20000. On the other hand, if you pass in those 
>>>> 20k random id's as one large sequence of allowed values into one range 
>>>> query, the profile time drops to about 50msec again.. :)
>>>> 
>>>> @Ron, can you make the query smarter? Doing the same with less parts?
>>>> 
>>>> Kind regards,
>>>> Geert
>>>> 
>>>>> -----Oorspronkelijk bericht-----
>>>>> Van: [email protected] [mailto:general- 
>>>>> [email protected]] Namens Michael Blakeley
>>>>> Verzonden: woensdag 31 juli 2013 20:20
>>>>> Aan: MarkLogic Developer Discussion
>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best 
>>>>> Practices
>>>>> 
>>>>> If that profile is correct, I'd be much more worried about the 
>>>>> cts:query constructor. I have a hard time getting that much elapsed 
>>>>> time out of a cts:query constructor.
>>>>> 
>>>>> declare variable $q := cts:and-query(
>>>>> (1 to 1000) ! cts:element-range-query(
>>>>> xs:QName('does-not-exist'), '=',
>>>>> xdmp:integer-to-hex(xdmp:random()))) ;
>>>>> 
>>>>> prof:eval('
>>>>> declare variable $qnode as element() external ;  cts:query($qnode)',  
>>>>> (xs:QName('qnode'),
>>>>> document { $q }/*))
>>>>> 
>>>>> On my laptop, the profile shows cts:query at 99% of shallow, with 
>>>>> anywhere from 5-15 ms total. And that's with 1000 terms, which seems
>>>> like
>>>>> a lot. But I'm testing against a nearly empty database, which might 
>>>>> make
>>>> a
>>>>> difference.
>>>>> 
>>>>> Is cts:query still a hotspot if you drop the registered-query code
>>>> entirely?
>>>>> 
>>>>> Is there a particular cts:query term type that triggers this?
>>>>> 
>>>>> Does xdmp:query-meters() show anything indicating database lookups?
>>>>> 
>>>>> -- Mike
>>>>> 
>>>>> On 31 Jul 2013, at 09:37 , Ron Hitchens <[email protected]> wrote:
>>>>> 
>>>>>> 
>>>>>> So here's a little more color on this, if anyone is still 
>>>>>> interested.  When I profile this code, where $query is a fairly 
>>>>>> complex serialized query that was previously computed and stored in 
>>>>>> a database:
>>>>>> 
>>>>>> declare variable $q1 := cts:registered-query (cts:register 
>>>>>> (cts:query
>>>>> ($query)), "unfiltered");
>>>>>> 
>>>>>> cts:search (fn:doc(), $q1)[1 to 5]
>>>>>> 
>>>>>> The top two items on the profile output are:
>>>>>> 
>>>>>> Shallow%  Shallow usecs   Deep%  Deep usecs  Expression
>>>>>> 80        125000          90     140000      cts:query($query)
>>>>>> 10         16000         100     156000      cts:registered-query
>>>> (cts:register
>>>>> (cts:query ($query)), "unfiltered")
>>>>>> 
>>>>>> Time spent on the actual search is so small it rounded to zero.
>>>>>> 
>>>>>> Doing this repeatedly yields similar timing, so it's not a cold 
>>>>>> cache situation or anything like that.
>>>>>> 
>>>>>> Profiling this:
>>>>>> 
>>>>>> declare variable $q2 := cts:registered-query (9156609332438599120,
>>>>> "unfiltered");
>>>>>> 
>>>>>> cts:search (fn:doc(), $q2)[1 to 5]
>>>>>> 
>>>>>> Yields times too fast to measure (all rounded to zero)
>>>>>> 
>>>>>> So, the potentially expensive to create query is being built every 
>>>>>> time and possibly being re-registered as well, given that 
>>>>>> cts:registered-query is taking a non-trivial amount of time.
>>>>>> 
>>>>>> On Jul 31, 2013, at 8:38 AM, Ron Hitchens <[email protected]> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> The overall entitlement query on each request is composed of many 
>>>>>>> sub-queries, some of which are static and registered, some of which 
>>>>>>> are dependent on the current time.  But even the static ones are 
>>>>>>> not finite, new ones can be created at any time as part of a new 
>>>>>>> entitlement definition.
>>>>>>> 
>>>>>>> I'm working on a scheme to catch and re-register all the static 
>>>>>>> queries in a given query tree when a search fails due to a missing 
>>>>>>> registration.  That should lazily re-register on first use after a 
>>>>>>> server restart as well.
>>>>>>> 
>>>>>>> ---
>>>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>>> +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>>> +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>>>> "No amount of belief establishes any fact." -Unknown
>>>>>>> 
>>>>>>> 
>>>>>>> On Jul 30, 2013, at 8:30 PM, Geert Josten <[email protected]>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Ron,
>>>>>>>> 
>>>>>>>> Are your queries such that you would have a finite number of sub-
>>>>> queries,
>>>>>>>> if you would break them into smaller subparts? Perhaps you can
>>>>> combine
>>>>>>>> multiple registered queries..
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Geert
>>>>>>>> 
>>>>>>>>> -----Oorspronkelijk bericht-----
>>>>>>>>> Van: [email protected] [mailto:general- 
>>>>>>>>> [email protected]] Namens Ron Hitchens
>>>>>>>>> Verzonden: dinsdag 30 juli 2013 2:29
>>>>>>>>> Aan: MarkLogic Developer Discussion
>>>>>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best
>>>>> Practices
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi Geert,
>>>>>>>>> 
>>>>>>>>> I've done something before where we stored reg ids in a map for 
>>>>>>>>> easy re-use.  In that case, there was a 1:1 correspondence 
>>>>>>>>> between the reg id and a meaningful business domain number.  On 
>>>>>>>>> this
>>>> project
>>>>>>>>> that's not the case.
>>>>>>>>> 
>>>>>>>>> Also, there is not a finite set of queries that need to be
>>>> registered
>>>>>>>>> so it's not feasible to pre-register everything once.  New ones 
>>>>>>>>> can
>>>> be
>>>>>>>>> created
>>>>>>>>> dynamically.  And the complicated queries are persisted in 
>>>>>>>>> another database and can be referenced later.  This means the 
>>>>>>>>> queries which should
>>>> be
>>>>>>>>> registered
>>>>>>>>> will persist across server restarts.  Which means there must be a
>>>> way
>>>>> to
>>>>>>>>> register the queries on first use, then make use of those
>>>> registered
>>>>>>>> queries
>>>>>>>>> on subsequent requests.
>>>>>>>>> 
>>>>>>>>> The re-register-before-each-use pattern solves that nicely, but 
>>>>>>>>> not
>>>>>>>> if
>>>>>>>>> the query construction cost must be re-paid each time.  It looks
>>>> like
>>>>>>>> the
>>>>>>>>> robust solution is going to have to be catching exceptions for
>>>>>>>> unregistered
>>>>>>>>> queries and reconstructing the registrations.  It's a shame 
>>>>>>>>> because
>>>>> that
>>>>>>>> is
>>>>>>>>> going to add unnecessary complexity to the code.
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>>>>> +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>>>>> +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>>>>>> "No amount of belief establishes any fact." -Unknown
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jul 29, 2013, at 8:15 PM, Geert Josten <[email protected]>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Ron,
>>>>>>>>>> 
>>>>>>>>>> I recently saw a strategy where they deliberately took a 
>>>>>>>>>> different approach. In their case the calculation of the queries 
>>>>>>>>>> was not straight-forward and could run into 30k search terms.
>>>> Additionally,
>>>>>>>>>> registering the query, and warming up cache by doing one initial
>>>>>>>> search
>>>>>>>>>> after registering each query took most time. They were searching
>>>>>>>> roughly
>>>>>>>>>> 40mln docs. The searches themselves were subsec..
>>>>>>>>>> 
>>>>>>>>>> Their approach was to store all registered query id's somewhere,
>>>> and
>>>>>>>> have
>>>>>>>>>> them readily available at actual search time. They also used a 
>>>>>>>>>> try
>>>>>>>> catch
>>>>>>>>>> to catch unregistered queries, though in their case they 
>>>>>>>>>> shouldn't actually occur, and these dramatically pulled down the 
>>>>>>>>>> average on performance tests.
>>>>>>>>>> 
>>>>>>>>>> How much chance is there that a query is unregistered, if you
>>>> would
>>>>>>>>>> prepare all queries beforehand?
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Geert
>>>>>>>>>> 
>>>>>>>>>>> -----Oorspronkelijk bericht-----
>>>>>>>>>>> Van: [email protected] [mailto:general- 
>>>>>>>>>>> [email protected]] Namens Michael Blakeley
>>>>>>>>>>> Verzonden: maandag 29 juli 2013 21:08
>>>>>>>>>>> Aan: MarkLogic Developer Discussion
>>>>>>>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best
>>>>>>>> Practices
>>>>>>>>>>> 
>>>>>>>>>>> I think you're using registered query as intended. That 
>>>>>>>>>>> behavior
>>>>>>>> sounds
>>>>>>>>>> odd
>>>>>>>>>>> to me. I would expect (2) to be cheap, just a hash operation on
>>>> the
>>>>>>>>>> query
>>>>>>>>>>> terms, and I would (3) to be the expensive step.
>>>>>>>>>>> 
>>>>>>>>>>> So I would contact support and see what they think.
>>>>>>>>>>> 
>>>>>>>>>>> -- Mike
>>>>>>>>>>> 
>>>>>>>>>>> On 29 Jul 2013, at 11:03 , Ron Hitchens <[email protected]> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> What is the best practice these days for using registered 
>>>>>>>>>>>> queries?  I was under the impression that the pattern should be:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1) Create your query:
>>>>>>>>>>>> $query := cts:and-query ((blah blah blah))
>>>>>>>>>>>> 2) Register it and make a registered query from it in one step:
>>>>>>>>>>>> $reg-query := cts:resistered-query (cts:register ($query),
>>>>>>>>>> "unfiltered")
>>>>>>>>>>>> 3) Use it in a search:
>>>>>>>>>>>> cts:search (fn:doc(), $reg-query)
>>>>>>>>>>>> 
>>>>>>>>>>>> The theory being that if the cts:query described by $query is 
>>>>>>>>>>>> already registered, then the registration is essentially a 
>>>>>>>>>>>> no-op and you'll get back the same ID.  And doing this every 
>>>>>>>>>>>> time
>>>> insures
>>>>>>>>>>>> that if the registered query has been evicted for some reason
>>>> then
>>>>>>>>>>>> it's re-registered and all is well.
>>>>>>>>>>>> 
>>>>>>>>>>>> It's a nice theory but seems to be based on the assumption 
>>>>>>>>>>>> that creating a cts:query object is very cheap.  
>>>>>>>>>>>> Unfortunately, I'm
>>>>>>>> finding
>>>>>>>>>>>> that this is often not the case, especially when there are 
>>>>>>>>>>>> lots
>>>> of
>>>>>>>>>>>> documents in the database.  I have a test case where 
>>>>>>>>>>>> performing
>>>>> Step
>>>>>>>> 2
>>>>>>>>>>>> above on a moderately complicated query takes roughly 200ms
>>>>> every
>>>>>>>>>>> time.
>>>>>>>>>>>> Others take even longer and all seem to be proportional to
>>>>> database
>>>>>>>>>> size.
>>>>>>>>>>>> But running Step 3 with cts:registered-query(<regid>) is very,
>>>> very
>>>>>>>>>>>> fast (~0ms).  Re-creating the query for re-registering every
>>>> time is
>>>>>>>>>>>> destroying the benefit of using a registered query.
>>>>>>>>>>>> 
>>>>>>>>>>>> I can obviously save the registration ID obtained from calling 
>>>>>>>>>>>> cts:register and then make a cts:registered-query each time, 
>>>>>>>>>>>> but
>>>>>>>> then
>>>>>>>>>>>> I'm not protected from the query becoming unregistered.  And
>>>>> there
>>>>>>>> is
>>>>>>>>>>>> no lightweight way to test if an ID is still registered.  The
>>>> only
>>>>>>>> way
>>>>>>>>>>>> I know to make this robust is to put a loop and try/catch 
>>>>>>>>>>>> around
>>>>> the
>>>>>>>>>>>> code that does the search.  But that requires passing along
>>>>> enough
>>>>>>>>>>>> context to re-construct and re-register the queries (there can
>>>> be
>>>>>>>>>>>> dozens of them in this case).  This is obviously a lot harder
>>>> than
>>>>>>>>>>>> building the complex query in one module and then passing it
>>>>> along
>>>>>>>>>>>> to the search code somewhere else.
>>>>>>>>>>>> 
>>>>>>>>>>>> What's the generally accepted best usage pattern for 
>>>>>>>>>>>> registered queries?  And is it my imagination or has the cost 
>>>>>>>>>>>> of running
>>>>>>>> queries
>>>>>>>>>>>> been moving from query evaluation into query construction?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> 
>>>>>>>>>>>> ---
>>>>>>>>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>>>>>>>> +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>>>>>>>> +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>>>>>>>>> "No amount of belief establishes any fact." -Unknown
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> General mailing list
>>>>>>>>>>>> [email protected] 
>>>>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> General mailing list
>>>>>>>>>>> [email protected] 
>>>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>>>>> _______________________________________________
>>>>>>>>>> General mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> General mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>>> _______________________________________________
>>>>>>>> General mailing list
>>>>>>>> [email protected]
>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> General mailing list
>>>>>> [email protected]
>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> [email protected]
>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Registered Query Best Practices

Reply via email to