Re: [MarkLogic Dev General] Registered Query Best Practices

Michael Blakeley Wed, 31 Jul 2013 15:27:41 -0700

It would be fun to speculate, but I think it would be more production to 
contact support with those questions.


-- Mike

On 31 Jul 2013, at 15:22 , Ron Hitchens <[email protected]> wrote:

> 
>   I didn't setup these machines (AWS images) but I think they
> have the recommended number of forests for the number of cores.
> There are around 6 million documents loaded and should be evenly
> distributed across the forests.
> 
>   But I want to remain focused on the fact that this idiom:
> 
> cts:resistered-query (cts:register ($query))
> 
>   Is actually counter-productive if "cts:register ($query)"
> is not always cheap when $query has already been registered.
> I expect to pay any query creation cost the first time, when
> $query is not already registered.  But not every time.
> 
>   Forest size or layout should be irrelevant here.  For this
> idiom to work it must be cheap for cts:register to check for a
> prior registration of $query and quickly return the existing
> registration id.  It appears that $query is being reified on
> EVERY call to cts:register, not just when it needs to actually
> be registered.
> 
>   Has something broken here?  Are the queries being reified
> now when they weren't before?  Is this a regression or has it
> always worked like this?
> 
> ---
> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>     +44 7879 358 212 (voice)          http://www.ronsoft.com
>     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> "No amount of belief establishes any fact." -Unknown
> 
> On Jul 31, 2013, at 10:21 PM, Danny Sokolsky <[email protected]> 
> wrote:
> 
>> OK, if it is range queries that are at play here, then it might be 
>> interesting to look at how big are your forests?  It is possible that adding 
>> more forests might increase your parallelism and make each forest's part of 
>> the index resolution smaller.  This is especially true with Range Index 
>> operations where there is a lot of data in each forest, because the range 
>> index files are memory mapped.
>> 
>> How many documents are in each forest, and how many forests do you have?
>> 
>> -Danny
>> 
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Ron Hitchens
>> Sent: Wednesday, July 31, 2013 2:13 PM
>> To: MarkLogic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] Registered Query Best Practices
>> 
>> 
>>  I actually do have a bunch of queries wrapped in a cts:and-query, not 
>> unlike Mikes example (among others).  In some cases these can be collapsed 
>> down to multiple values in one query, in other cases not.
>> But as I said in my reply to Mike, the real issue is that the cost of 
>> constructing a given query increases with the size of the database/indexes.
>> 
>> ---
>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>    +44 7879 358 212 (voice)          http://www.ronsoft.com
>>    +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>> "No amount of belief establishes any fact." -Unknown
>> 
>> 
>> On Jul 31, 2013, at 8:11 PM, Geert Josten <[email protected]> wrote:
>> 
>>> The size and structure of the query can matter a lot. Michael's 
>>> example shows 1000 query parts, but you'll see a linear growth if you 
>>> let it iterate up to 20000. On the other hand, if you pass in those 
>>> 20k random id's as one large sequence of allowed values into one range 
>>> query, the profile time drops to about 50msec again.. :)
>>> 
>>> @Ron, can you make the query smarter? Doing the same with less parts?
>>> 
>>> Kind regards,
>>> Geert
>>> 
>>>> -----Oorspronkelijk bericht-----
>>>> Van: [email protected] [mailto:general- 
>>>> [email protected]] Namens Michael Blakeley
>>>> Verzonden: woensdag 31 juli 2013 20:20
>>>> Aan: MarkLogic Developer Discussion
>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best 
>>>> Practices
>>>> 
>>>> If that profile is correct, I'd be much more worried about the 
>>>> cts:query constructor. I have a hard time getting that much elapsed 
>>>> time out of a cts:query constructor.
>>>> 
>>>> declare variable $q := cts:and-query(
>>>> (1 to 1000) ! cts:element-range-query(
>>>>  xs:QName('does-not-exist'), '=',
>>>>  xdmp:integer-to-hex(xdmp:random()))) ;
>>>> 
>>>> prof:eval('
>>>> declare variable $qnode as element() external ;  cts:query($qnode)',  
>>>> (xs:QName('qnode'),
>>>> document { $q }/*))
>>>> 
>>>> On my laptop, the profile shows cts:query at 99% of shallow, with 
>>>> anywhere from 5-15 ms total. And that's with 1000 terms, which seems
>>> like
>>>> a lot. But I'm testing against a nearly empty database, which might 
>>>> make
>>> a
>>>> difference.
>>>> 
>>>> Is cts:query still a hotspot if you drop the registered-query code
>>> entirely?
>>>> 
>>>> Is there a particular cts:query term type that triggers this?
>>>> 
>>>> Does xdmp:query-meters() show anything indicating database lookups?
>>>> 
>>>> -- Mike
>>>> 
>>>> On 31 Jul 2013, at 09:37 , Ron Hitchens <[email protected]> wrote:
>>>> 
>>>>> 
>>>>> So here's a little more color on this, if anyone is still 
>>>>> interested.  When I profile this code, where $query is a fairly 
>>>>> complex serialized query that was previously computed and stored in 
>>>>> a database:
>>>>> 
>>>>> declare variable $q1 := cts:registered-query (cts:register 
>>>>> (cts:query
>>>> ($query)), "unfiltered");
>>>>> 
>>>>> cts:search (fn:doc(), $q1)[1 to 5]
>>>>> 
>>>>> The top two items on the profile output are:
>>>>> 
>>>>> Shallow%  Shallow usecs   Deep%  Deep usecs  Expression
>>>>> 80        125000          90     140000      cts:query($query)
>>>>> 10         16000         100     156000      cts:registered-query
>>> (cts:register
>>>> (cts:query ($query)), "unfiltered")
>>>>> 
>>>>> Time spent on the actual search is so small it rounded to zero.
>>>>> 
>>>>> Doing this repeatedly yields similar timing, so it's not a cold 
>>>>> cache situation or anything like that.
>>>>> 
>>>>> Profiling this:
>>>>> 
>>>>> declare variable $q2 := cts:registered-query (9156609332438599120,
>>>> "unfiltered");
>>>>> 
>>>>> cts:search (fn:doc(), $q2)[1 to 5]
>>>>> 
>>>>> Yields times too fast to measure (all rounded to zero)
>>>>> 
>>>>> So, the potentially expensive to create query is being built every 
>>>>> time and possibly being re-registered as well, given that 
>>>>> cts:registered-query is taking a non-trivial amount of time.
>>>>> 
>>>>> On Jul 31, 2013, at 8:38 AM, Ron Hitchens <[email protected]> wrote:
>>>>> 
>>>>>> 
>>>>>> The overall entitlement query on each request is composed of many 
>>>>>> sub-queries, some of which are static and registered, some of which 
>>>>>> are dependent on the current time.  But even the static ones are 
>>>>>> not finite, new ones can be created at any time as part of a new 
>>>>>> entitlement definition.
>>>>>> 
>>>>>> I'm working on a scheme to catch and re-register all the static 
>>>>>> queries in a given query tree when a search fails due to a missing 
>>>>>> registration.  That should lazily re-register on first use after a 
>>>>>> server restart as well.
>>>>>> 
>>>>>> ---
>>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>> +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>> +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>>> "No amount of belief establishes any fact." -Unknown
>>>>>> 
>>>>>> 
>>>>>> On Jul 30, 2013, at 8:30 PM, Geert Josten <[email protected]>
>>>> wrote:
>>>>>> 
>>>>>>> Hi Ron,
>>>>>>> 
>>>>>>> Are your queries such that you would have a finite number of sub-
>>>> queries,
>>>>>>> if you would break them into smaller subparts? Perhaps you can
>>>> combine
>>>>>>> multiple registered queries..
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Geert
>>>>>>> 
>>>>>>>> -----Oorspronkelijk bericht-----
>>>>>>>> Van: [email protected] [mailto:general- 
>>>>>>>> [email protected]] Namens Ron Hitchens
>>>>>>>> Verzonden: dinsdag 30 juli 2013 2:29
>>>>>>>> Aan: MarkLogic Developer Discussion
>>>>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best
>>>> Practices
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Geert,
>>>>>>>> 
>>>>>>>> I've done something before where we stored reg ids in a map for 
>>>>>>>> easy re-use.  In that case, there was a 1:1 correspondence 
>>>>>>>> between the reg id and a meaningful business domain number.  On 
>>>>>>>> this
>>> project
>>>>>>>> that's not the case.
>>>>>>>> 
>>>>>>>> Also, there is not a finite set of queries that need to be
>>> registered
>>>>>>>> so it's not feasible to pre-register everything once.  New ones 
>>>>>>>> can
>>> be
>>>>>>>> created
>>>>>>>> dynamically.  And the complicated queries are persisted in 
>>>>>>>> another database and can be referenced later.  This means the 
>>>>>>>> queries which should
>>> be
>>>>>>>> registered
>>>>>>>> will persist across server restarts.  Which means there must be a
>>> way
>>>> to
>>>>>>>> register the queries on first use, then make use of those
>>> registered
>>>>>>> queries
>>>>>>>> on subsequent requests.
>>>>>>>> 
>>>>>>>> The re-register-before-each-use pattern solves that nicely, but 
>>>>>>>> not
>>>>>>> if
>>>>>>>> the query construction cost must be re-paid each time.  It looks
>>> like
>>>>>>> the
>>>>>>>> robust solution is going to have to be catching exceptions for
>>>>>>> unregistered
>>>>>>>> queries and reconstructing the registrations.  It's a shame 
>>>>>>>> because
>>>> that
>>>>>>> is
>>>>>>>> going to add unnecessary complexity to the code.
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>>>> +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>>>> +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>>>>> "No amount of belief establishes any fact." -Unknown
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jul 29, 2013, at 8:15 PM, Geert Josten <[email protected]>
>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Ron,
>>>>>>>>> 
>>>>>>>>> I recently saw a strategy where they deliberately took a 
>>>>>>>>> different approach. In their case the calculation of the queries 
>>>>>>>>> was not straight-forward and could run into 30k search terms.
>>> Additionally,
>>>>>>>>> registering the query, and warming up cache by doing one initial
>>>>>>> search
>>>>>>>>> after registering each query took most time. They were searching
>>>>>>> roughly
>>>>>>>>> 40mln docs. The searches themselves were subsec..
>>>>>>>>> 
>>>>>>>>> Their approach was to store all registered query id's somewhere,
>>> and
>>>>>>> have
>>>>>>>>> them readily available at actual search time. They also used a 
>>>>>>>>> try
>>>>>>> catch
>>>>>>>>> to catch unregistered queries, though in their case they 
>>>>>>>>> shouldn't actually occur, and these dramatically pulled down the 
>>>>>>>>> average on performance tests.
>>>>>>>>> 
>>>>>>>>> How much chance is there that a query is unregistered, if you
>>> would
>>>>>>>>> prepare all queries beforehand?
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Geert
>>>>>>>>> 
>>>>>>>>>> -----Oorspronkelijk bericht-----
>>>>>>>>>> Van: [email protected] [mailto:general- 
>>>>>>>>>> [email protected]] Namens Michael Blakeley
>>>>>>>>>> Verzonden: maandag 29 juli 2013 21:08
>>>>>>>>>> Aan: MarkLogic Developer Discussion
>>>>>>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best
>>>>>>> Practices
>>>>>>>>>> 
>>>>>>>>>> I think you're using registered query as intended. That 
>>>>>>>>>> behavior
>>>>>>> sounds
>>>>>>>>> odd
>>>>>>>>>> to me. I would expect (2) to be cheap, just a hash operation on
>>> the
>>>>>>>>> query
>>>>>>>>>> terms, and I would (3) to be the expensive step.
>>>>>>>>>> 
>>>>>>>>>> So I would contact support and see what they think.
>>>>>>>>>> 
>>>>>>>>>> -- Mike
>>>>>>>>>> 
>>>>>>>>>> On 29 Jul 2013, at 11:03 , Ron Hitchens <[email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> What is the best practice these days for using registered 
>>>>>>>>>>> queries?  I was under the impression that the pattern should be:
>>>>>>>>>>> 
>>>>>>>>>>> 1) Create your query:
>>>>>>>>>>> $query := cts:and-query ((blah blah blah))
>>>>>>>>>>> 2) Register it and make a registered query from it in one step:
>>>>>>>>>>> $reg-query := cts:resistered-query (cts:register ($query),
>>>>>>>>> "unfiltered")
>>>>>>>>>>> 3) Use it in a search:
>>>>>>>>>>> cts:search (fn:doc(), $reg-query)
>>>>>>>>>>> 
>>>>>>>>>>> The theory being that if the cts:query described by $query is 
>>>>>>>>>>> already registered, then the registration is essentially a 
>>>>>>>>>>> no-op and you'll get back the same ID.  And doing this every 
>>>>>>>>>>> time
>>> insures
>>>>>>>>>>> that if the registered query has been evicted for some reason
>>> then
>>>>>>>>>>> it's re-registered and all is well.
>>>>>>>>>>> 
>>>>>>>>>>> It's a nice theory but seems to be based on the assumption 
>>>>>>>>>>> that creating a cts:query object is very cheap.  
>>>>>>>>>>> Unfortunately, I'm
>>>>>>> finding
>>>>>>>>>>> that this is often not the case, especially when there are 
>>>>>>>>>>> lots
>>> of
>>>>>>>>>>> documents in the database.  I have a test case where 
>>>>>>>>>>> performing
>>>> Step
>>>>>>> 2
>>>>>>>>>>> above on a moderately complicated query takes roughly 200ms
>>>> every
>>>>>>>>>> time.
>>>>>>>>>>> Others take even longer and all seem to be proportional to
>>>> database
>>>>>>>>> size.
>>>>>>>>>>> But running Step 3 with cts:registered-query(<regid>) is very,
>>> very
>>>>>>>>>>> fast (~0ms).  Re-creating the query for re-registering every
>>> time is
>>>>>>>>>>> destroying the benefit of using a registered query.
>>>>>>>>>>> 
>>>>>>>>>>> I can obviously save the registration ID obtained from calling 
>>>>>>>>>>> cts:register and then make a cts:registered-query each time, 
>>>>>>>>>>> but
>>>>>>> then
>>>>>>>>>>> I'm not protected from the query becoming unregistered.  And
>>>> there
>>>>>>> is
>>>>>>>>>>> no lightweight way to test if an ID is still registered.  The
>>> only
>>>>>>> way
>>>>>>>>>>> I know to make this robust is to put a loop and try/catch 
>>>>>>>>>>> around
>>>> the
>>>>>>>>>>> code that does the search.  But that requires passing along
>>>> enough
>>>>>>>>>>> context to re-construct and re-register the queries (there can
>>> be
>>>>>>>>>>> dozens of them in this case).  This is obviously a lot harder
>>> than
>>>>>>>>>>> building the complex query in one module and then passing it
>>>> along
>>>>>>>>>>> to the search code somewhere else.
>>>>>>>>>>> 
>>>>>>>>>>> What's the generally accepted best usage pattern for 
>>>>>>>>>>> registered queries?  And is it my imagination or has the cost 
>>>>>>>>>>> of running
>>>>>>> queries
>>>>>>>>>>> been moving from query evaluation into query construction?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks.
>>>>>>>>>>> 
>>>>>>>>>>> ---
>>>>>>>>>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>>>>>>>>> +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>>>>>>>>> +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>>>>>>>>>> "No amount of belief establishes any fact." -Unknown
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> General mailing list
>>>>>>>>>>> [email protected] 
>>>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> General mailing list
>>>>>>>>>> [email protected] 
>>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>>>> _______________________________________________
>>>>>>>>> General mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> General mailing list
>>>>>>>> [email protected]
>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>> _______________________________________________
>>>>>>> General mailing list
>>>>>>> [email protected]
>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> [email protected]
>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Registered Query Best Practices

Reply via email to