Re: [MarkLogic Dev General] How to give hints to MarkLogic on which condition is faster to check first?

Jason Hunter Fri, 20 Apr 2012 22:37:19 -0700

Bottom line: For your word count use-case you should get the performance you 
want if you can write it as a cts:search of a cts:and-query.


But it doesn't quite work like you're thinking.  It's not ordered execution 
like it would be in Java.  It doesn't short-circuit.  To run a cts:search 
there's no looking inside documents (at least not til the final filtering phase 
to verify results).  It's all just term list set arithmetic.

In your example, the num-words constraint will quickly determine the documents 
with the right number of words, and that list of document ids taken from the 
index will be intersected with the document ids matching the other 
constraint(s) based on indexes.  If the other constraints aren't selective (a 
better word to use here than expensive) then it's OK because the first 
constraint is highly selective.  Any documents with phrases that match the 
second constraint but not the first (wrong # of words) won't be included 
because they don't intersect.

Intersecting against long term lists is efficient so there's no need for 
MarkLogic to short circuit.  Your subqueries are resolved to the extent 
possible by the indexes.

-jh-

On Apr 20, 2012, at 8:54 PM, seme...@hotmail.com wrote:

> So could I do :
> 
> cts:search(/,
>     cts:and-query(
>         cts:inexpensive-query...
>         ,
>         cts:expensive-query...
>     )
> )
> 
> and MarkLogic will check the first condition (cts:inexpensive-query) first 
> and only check the second condition if the first is true?
> 
> 
> CC: general@developer.marklogic.com
> From: m...@blakeley.com
> Date: Fri, 20 Apr 2012 19:49:17 -0700
> To: general@developer.marklogic.com
> Subject: Re: [MarkLogic Dev General] How to give hints to MarkLogic on which  
> condition is faster to check first?
> 
> Yes, boolean ops will short-circuit. You can test this for yourself using 
> xdmp:sleep and xdmp:elapsed-time.
> 
> -- Mike
> 
> On Apr 20, 2012, at 19:15, "seme...@hotmail.com" <seme...@hotmail.com> wrote:
> 
> I may have some queries where the comparison is expensive. So what I'd like 
> to do is add an extra element in each doc which is a "shortcut" to check for 
> first before doing the expensive comparison.
> 
> For example, so suppose I had data that had randoms words ("boat", 
> "alligator", "house") in an element called "words" and there was a random 
> number of words in the element (say, 5 to 50). Other documents may have the 
> same words but in a different order. I want to find the documents that have 
> the same number of words and the same words regardless of the order.
> 
> So 
> 
> 1: <words>boat alligator house bandit flower</words>
> 
> and 
> 2: <words>bandit alligator bandit flower boat</words>
> would be a match
> 
> but 
> 3: <words>bandit alligator bandit flower boat island</words>
> would not be a match because is has an extra word
> 
> I thought that I can add a new element to each doc which represents the 
> number of words (5, 6, 11, 20, etc) and I can first check that the doc has 
> the right number of words before I check to see if it has the same words. I 
> am thinking the extra check on the number of words would shortcut the query 
> to not even bother checking individual words if the number of words doesn't 
> match and save me some time. The time may add up if I have millions or tens 
> of millions of docs to query against.
> 
> So if my thinking is correct, then I would have documents that look like this:
> 
> <doc>
>     <words>bandit alligator bandit flower boat</words>
>     <num-words>5</num-words>
> </doc>
> 
> I could put a range index on the "num-words" element of type xs:int.
> 
> Then I'd like to write queries so that the num-words condition is checked 
> first by the magical MarkLogic engine and only if that first condition is met 
> would it check the rest.
> 
> I know in Java that the runtime environment won't check the second condition 
> if the first is false in a boolean statement. So:
> 
> if (1 == 0 && explode()) { ....
> 
> "explode()" will never be called because the first condition in the statement 
> is false. But the order is important; "1 == 0" must be before "explode()" in 
> the statement because that statement will be evaluated from left to right.
> 
> I don't know if XQuery or MarkLogic works that way (didn't see anything in 
> the spec) and I know that MarkLogic has all sorts of optimizations, but how 
> will it know that it's faster to check the "num-words" condition before the 
> individual words? Can I write a cts:query that gives a hint to MarkLogic to 
> give precedence to one condition over another to save time? *I* know that the 
> num-words check is faster but how can *MarkLogic* know that? 
> 
> I suppose it could be argued that it doesn't really matter because MarkLogic 
> runs fast anyway, but I'm talking about long running queries over massive 
> data sets so even small amounts of time are important to me.
> 
> Thanks!
> 
> -Ryan
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________ General mailing list 
> General@developer.marklogic.comhttp://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] How to give hints to MarkLogic on which condition is faster to check first?

Reply via email to