It would be fun to speculate, but I think it would be more production to contact support with those questions.
-- Mike On 31 Jul 2013, at 15:22 , Ron Hitchens <[email protected]> wrote: > > I didn't setup these machines (AWS images) but I think they > have the recommended number of forests for the number of cores. > There are around 6 million documents loaded and should be evenly > distributed across the forests. > > But I want to remain focused on the fact that this idiom: > > cts:resistered-query (cts:register ($query)) > > Is actually counter-productive if "cts:register ($query)" > is not always cheap when $query has already been registered. > I expect to pay any query creation cost the first time, when > $query is not already registered. But not every time. > > Forest size or layout should be irrelevant here. For this > idiom to work it must be cheap for cts:register to check for a > prior registration of $query and quickly return the existing > registration id. It appears that $query is being reified on > EVERY call to cts:register, not just when it needs to actually > be registered. > > Has something broken here? Are the queries being reified > now when they weren't before? Is this a regression or has it > always worked like this? > > --- > Ron Hitchens {mailto:[email protected]} Ronsoft Technologies > +44 7879 358 212 (voice) http://www.ronsoft.com > +1 707 924 3878 (fax) Bit Twiddling At Its Finest > "No amount of belief establishes any fact." -Unknown > > On Jul 31, 2013, at 10:21 PM, Danny Sokolsky <[email protected]> > wrote: > >> OK, if it is range queries that are at play here, then it might be >> interesting to look at how big are your forests? It is possible that adding >> more forests might increase your parallelism and make each forest's part of >> the index resolution smaller. This is especially true with Range Index >> operations where there is a lot of data in each forest, because the range >> index files are memory mapped. >> >> How many documents are in each forest, and how many forests do you have? >> >> -Danny >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Ron Hitchens >> Sent: Wednesday, July 31, 2013 2:13 PM >> To: MarkLogic Developer Discussion >> Subject: Re: [MarkLogic Dev General] Registered Query Best Practices >> >> >> I actually do have a bunch of queries wrapped in a cts:and-query, not >> unlike Mikes example (among others). In some cases these can be collapsed >> down to multiple values in one query, in other cases not. >> But as I said in my reply to Mike, the real issue is that the cost of >> constructing a given query increases with the size of the database/indexes. >> >> --- >> Ron Hitchens {mailto:[email protected]} Ronsoft Technologies >> +44 7879 358 212 (voice) http://www.ronsoft.com >> +1 707 924 3878 (fax) Bit Twiddling At Its Finest >> "No amount of belief establishes any fact." -Unknown >> >> >> On Jul 31, 2013, at 8:11 PM, Geert Josten <[email protected]> wrote: >> >>> The size and structure of the query can matter a lot. Michael's >>> example shows 1000 query parts, but you'll see a linear growth if you >>> let it iterate up to 20000. On the other hand, if you pass in those >>> 20k random id's as one large sequence of allowed values into one range >>> query, the profile time drops to about 50msec again.. :) >>> >>> @Ron, can you make the query smarter? Doing the same with less parts? >>> >>> Kind regards, >>> Geert >>> >>>> -----Oorspronkelijk bericht----- >>>> Van: [email protected] [mailto:general- >>>> [email protected]] Namens Michael Blakeley >>>> Verzonden: woensdag 31 juli 2013 20:20 >>>> Aan: MarkLogic Developer Discussion >>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best >>>> Practices >>>> >>>> If that profile is correct, I'd be much more worried about the >>>> cts:query constructor. I have a hard time getting that much elapsed >>>> time out of a cts:query constructor. >>>> >>>> declare variable $q := cts:and-query( >>>> (1 to 1000) ! cts:element-range-query( >>>> xs:QName('does-not-exist'), '=', >>>> xdmp:integer-to-hex(xdmp:random()))) ; >>>> >>>> prof:eval(' >>>> declare variable $qnode as element() external ; cts:query($qnode)', >>>> (xs:QName('qnode'), >>>> document { $q }/*)) >>>> >>>> On my laptop, the profile shows cts:query at 99% of shallow, with >>>> anywhere from 5-15 ms total. And that's with 1000 terms, which seems >>> like >>>> a lot. But I'm testing against a nearly empty database, which might >>>> make >>> a >>>> difference. >>>> >>>> Is cts:query still a hotspot if you drop the registered-query code >>> entirely? >>>> >>>> Is there a particular cts:query term type that triggers this? >>>> >>>> Does xdmp:query-meters() show anything indicating database lookups? >>>> >>>> -- Mike >>>> >>>> On 31 Jul 2013, at 09:37 , Ron Hitchens <[email protected]> wrote: >>>> >>>>> >>>>> So here's a little more color on this, if anyone is still >>>>> interested. When I profile this code, where $query is a fairly >>>>> complex serialized query that was previously computed and stored in >>>>> a database: >>>>> >>>>> declare variable $q1 := cts:registered-query (cts:register >>>>> (cts:query >>>> ($query)), "unfiltered"); >>>>> >>>>> cts:search (fn:doc(), $q1)[1 to 5] >>>>> >>>>> The top two items on the profile output are: >>>>> >>>>> Shallow% Shallow usecs Deep% Deep usecs Expression >>>>> 80 125000 90 140000 cts:query($query) >>>>> 10 16000 100 156000 cts:registered-query >>> (cts:register >>>> (cts:query ($query)), "unfiltered") >>>>> >>>>> Time spent on the actual search is so small it rounded to zero. >>>>> >>>>> Doing this repeatedly yields similar timing, so it's not a cold >>>>> cache situation or anything like that. >>>>> >>>>> Profiling this: >>>>> >>>>> declare variable $q2 := cts:registered-query (9156609332438599120, >>>> "unfiltered"); >>>>> >>>>> cts:search (fn:doc(), $q2)[1 to 5] >>>>> >>>>> Yields times too fast to measure (all rounded to zero) >>>>> >>>>> So, the potentially expensive to create query is being built every >>>>> time and possibly being re-registered as well, given that >>>>> cts:registered-query is taking a non-trivial amount of time. >>>>> >>>>> On Jul 31, 2013, at 8:38 AM, Ron Hitchens <[email protected]> wrote: >>>>> >>>>>> >>>>>> The overall entitlement query on each request is composed of many >>>>>> sub-queries, some of which are static and registered, some of which >>>>>> are dependent on the current time. But even the static ones are >>>>>> not finite, new ones can be created at any time as part of a new >>>>>> entitlement definition. >>>>>> >>>>>> I'm working on a scheme to catch and re-register all the static >>>>>> queries in a given query tree when a search fails due to a missing >>>>>> registration. That should lazily re-register on first use after a >>>>>> server restart as well. >>>>>> >>>>>> --- >>>>>> Ron Hitchens {mailto:[email protected]} Ronsoft Technologies >>>>>> +44 7879 358 212 (voice) http://www.ronsoft.com >>>>>> +1 707 924 3878 (fax) Bit Twiddling At Its Finest >>>>>> "No amount of belief establishes any fact." -Unknown >>>>>> >>>>>> >>>>>> On Jul 30, 2013, at 8:30 PM, Geert Josten <[email protected]> >>>> wrote: >>>>>> >>>>>>> Hi Ron, >>>>>>> >>>>>>> Are your queries such that you would have a finite number of sub- >>>> queries, >>>>>>> if you would break them into smaller subparts? Perhaps you can >>>> combine >>>>>>> multiple registered queries.. >>>>>>> >>>>>>> Cheers, >>>>>>> Geert >>>>>>> >>>>>>>> -----Oorspronkelijk bericht----- >>>>>>>> Van: [email protected] [mailto:general- >>>>>>>> [email protected]] Namens Ron Hitchens >>>>>>>> Verzonden: dinsdag 30 juli 2013 2:29 >>>>>>>> Aan: MarkLogic Developer Discussion >>>>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best >>>> Practices >>>>>>>> >>>>>>>> >>>>>>>> Hi Geert, >>>>>>>> >>>>>>>> I've done something before where we stored reg ids in a map for >>>>>>>> easy re-use. In that case, there was a 1:1 correspondence >>>>>>>> between the reg id and a meaningful business domain number. On >>>>>>>> this >>> project >>>>>>>> that's not the case. >>>>>>>> >>>>>>>> Also, there is not a finite set of queries that need to be >>> registered >>>>>>>> so it's not feasible to pre-register everything once. New ones >>>>>>>> can >>> be >>>>>>>> created >>>>>>>> dynamically. And the complicated queries are persisted in >>>>>>>> another database and can be referenced later. This means the >>>>>>>> queries which should >>> be >>>>>>>> registered >>>>>>>> will persist across server restarts. Which means there must be a >>> way >>>> to >>>>>>>> register the queries on first use, then make use of those >>> registered >>>>>>> queries >>>>>>>> on subsequent requests. >>>>>>>> >>>>>>>> The re-register-before-each-use pattern solves that nicely, but >>>>>>>> not >>>>>>> if >>>>>>>> the query construction cost must be re-paid each time. It looks >>> like >>>>>>> the >>>>>>>> robust solution is going to have to be catching exceptions for >>>>>>> unregistered >>>>>>>> queries and reconstructing the registrations. It's a shame >>>>>>>> because >>>> that >>>>>>> is >>>>>>>> going to add unnecessary complexity to the code. >>>>>>>> >>>>>>>> --- >>>>>>>> Ron Hitchens {mailto:[email protected]} Ronsoft Technologies >>>>>>>> +44 7879 358 212 (voice) http://www.ronsoft.com >>>>>>>> +1 707 924 3878 (fax) Bit Twiddling At Its Finest >>>>>>>> "No amount of belief establishes any fact." -Unknown >>>>>>>> >>>>>>>> >>>>>>>> On Jul 29, 2013, at 8:15 PM, Geert Josten <[email protected]> >>>> wrote: >>>>>>>> >>>>>>>>> Hi Ron, >>>>>>>>> >>>>>>>>> I recently saw a strategy where they deliberately took a >>>>>>>>> different approach. In their case the calculation of the queries >>>>>>>>> was not straight-forward and could run into 30k search terms. >>> Additionally, >>>>>>>>> registering the query, and warming up cache by doing one initial >>>>>>> search >>>>>>>>> after registering each query took most time. They were searching >>>>>>> roughly >>>>>>>>> 40mln docs. The searches themselves were subsec.. >>>>>>>>> >>>>>>>>> Their approach was to store all registered query id's somewhere, >>> and >>>>>>> have >>>>>>>>> them readily available at actual search time. They also used a >>>>>>>>> try >>>>>>> catch >>>>>>>>> to catch unregistered queries, though in their case they >>>>>>>>> shouldn't actually occur, and these dramatically pulled down the >>>>>>>>> average on performance tests. >>>>>>>>> >>>>>>>>> How much chance is there that a query is unregistered, if you >>> would >>>>>>>>> prepare all queries beforehand? >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Geert >>>>>>>>> >>>>>>>>>> -----Oorspronkelijk bericht----- >>>>>>>>>> Van: [email protected] [mailto:general- >>>>>>>>>> [email protected]] Namens Michael Blakeley >>>>>>>>>> Verzonden: maandag 29 juli 2013 21:08 >>>>>>>>>> Aan: MarkLogic Developer Discussion >>>>>>>>>> Onderwerp: Re: [MarkLogic Dev General] Registered Query Best >>>>>>> Practices >>>>>>>>>> >>>>>>>>>> I think you're using registered query as intended. That >>>>>>>>>> behavior >>>>>>> sounds >>>>>>>>> odd >>>>>>>>>> to me. I would expect (2) to be cheap, just a hash operation on >>> the >>>>>>>>> query >>>>>>>>>> terms, and I would (3) to be the expensive step. >>>>>>>>>> >>>>>>>>>> So I would contact support and see what they think. >>>>>>>>>> >>>>>>>>>> -- Mike >>>>>>>>>> >>>>>>>>>> On 29 Jul 2013, at 11:03 , Ron Hitchens <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> What is the best practice these days for using registered >>>>>>>>>>> queries? I was under the impression that the pattern should be: >>>>>>>>>>> >>>>>>>>>>> 1) Create your query: >>>>>>>>>>> $query := cts:and-query ((blah blah blah)) >>>>>>>>>>> 2) Register it and make a registered query from it in one step: >>>>>>>>>>> $reg-query := cts:resistered-query (cts:register ($query), >>>>>>>>> "unfiltered") >>>>>>>>>>> 3) Use it in a search: >>>>>>>>>>> cts:search (fn:doc(), $reg-query) >>>>>>>>>>> >>>>>>>>>>> The theory being that if the cts:query described by $query is >>>>>>>>>>> already registered, then the registration is essentially a >>>>>>>>>>> no-op and you'll get back the same ID. And doing this every >>>>>>>>>>> time >>> insures >>>>>>>>>>> that if the registered query has been evicted for some reason >>> then >>>>>>>>>>> it's re-registered and all is well. >>>>>>>>>>> >>>>>>>>>>> It's a nice theory but seems to be based on the assumption >>>>>>>>>>> that creating a cts:query object is very cheap. >>>>>>>>>>> Unfortunately, I'm >>>>>>> finding >>>>>>>>>>> that this is often not the case, especially when there are >>>>>>>>>>> lots >>> of >>>>>>>>>>> documents in the database. I have a test case where >>>>>>>>>>> performing >>>> Step >>>>>>> 2 >>>>>>>>>>> above on a moderately complicated query takes roughly 200ms >>>> every >>>>>>>>>> time. >>>>>>>>>>> Others take even longer and all seem to be proportional to >>>> database >>>>>>>>> size. >>>>>>>>>>> But running Step 3 with cts:registered-query(<regid>) is very, >>> very >>>>>>>>>>> fast (~0ms). Re-creating the query for re-registering every >>> time is >>>>>>>>>>> destroying the benefit of using a registered query. >>>>>>>>>>> >>>>>>>>>>> I can obviously save the registration ID obtained from calling >>>>>>>>>>> cts:register and then make a cts:registered-query each time, >>>>>>>>>>> but >>>>>>> then >>>>>>>>>>> I'm not protected from the query becoming unregistered. And >>>> there >>>>>>> is >>>>>>>>>>> no lightweight way to test if an ID is still registered. The >>> only >>>>>>> way >>>>>>>>>>> I know to make this robust is to put a loop and try/catch >>>>>>>>>>> around >>>> the >>>>>>>>>>> code that does the search. But that requires passing along >>>> enough >>>>>>>>>>> context to re-construct and re-register the queries (there can >>> be >>>>>>>>>>> dozens of them in this case). This is obviously a lot harder >>> than >>>>>>>>>>> building the complex query in one module and then passing it >>>> along >>>>>>>>>>> to the search code somewhere else. >>>>>>>>>>> >>>>>>>>>>> What's the generally accepted best usage pattern for >>>>>>>>>>> registered queries? And is it my imagination or has the cost >>>>>>>>>>> of running >>>>>>> queries >>>>>>>>>>> been moving from query evaluation into query construction? >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> >>>>>>>>>>> --- >>>>>>>>>>> Ron Hitchens {mailto:[email protected]} Ronsoft Technologies >>>>>>>>>>> +44 7879 358 212 (voice) http://www.ronsoft.com >>>>>>>>>>> +1 707 924 3878 (fax) Bit Twiddling At Its Finest >>>>>>>>>>> "No amount of belief establishes any fact." -Unknown >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> General mailing list >>>>>>>>>>> [email protected] >>>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> General mailing list >>>>>>>>>> [email protected] >>>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general >>>>>>>>> _______________________________________________ >>>>>>>>> General mailing list >>>>>>>>> [email protected] >>>>>>>>> http://developer.marklogic.com/mailman/listinfo/general >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> General mailing list >>>>>>>> [email protected] >>>>>>>> http://developer.marklogic.com/mailman/listinfo/general >>>>>>> _______________________________________________ >>>>>>> General mailing list >>>>>>> [email protected] >>>>>>> http://developer.marklogic.com/mailman/listinfo/general >>>>>> >>>>> >>>>> _______________________________________________ >>>>> General mailing list >>>>> [email protected] >>>>> http://developer.marklogic.com/mailman/listinfo/general >>>>> >>>> >>>> _______________________________________________ >>>> General mailing list >>>> [email protected] >>>> http://developer.marklogic.com/mailman/listinfo/general >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://developer.marklogic.com/mailman/listinfo/general >> >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
