But doesn't it still take time to generate the second term list (the expensive one)? And if the second term list (expensive) will only be a subset of the first term list (inexpensive) wouldn't there be performance gains to not intersect the two lists that were created independently (from the entire index) but rather create the second list out of the first list? Something somewhere has got to be checking values against each other and it seems like the fewer checks that need to happen the faster things will go. But maybe not -Ryan
From: jhun...@marklogic.com Date: Fri, 20 Apr 2012 22:37:03 -0700 To: general@developer.marklogic.com Subject: Re: [MarkLogic Dev General] How to give hints to MarkLogic on which condition is faster to check first? Bottom line: For your word count use-case you should get the performance you want if you can write it as a cts:search of a cts:and-query. But it doesn't quite work like you're thinking. It's not ordered execution like it would be in Java. It doesn't short-circuit. To run a cts:search there's no looking inside documents (at least not til the final filtering phase to verify results). It's all just term list set arithmetic. In your example, the num-words constraint will quickly determine the documents with the right number of words, and that list of document ids taken from the index will be intersected with the document ids matching the other constraint(s) based on indexes. If the other constraints aren't selective (a better word to use here than expensive) then it's OK because the first constraint is highly selective. Any documents with phrases that match the second constraint but not the first (wrong # of words) won't be included because they don't intersect. Intersecting against long term lists is efficient so there's no need for MarkLogic to short circuit. Your subqueries are resolved to the extent possible by the indexes. -jh- On Apr 20, 2012, at 8:54 PM, seme...@hotmail.com wrote:So could I do : cts:search(/, cts:and-query( cts:inexpensive-query... , cts:expensive-query... )) and MarkLogic will check the first condition (cts:inexpensive-query) first and only check the second condition if the first is true? CC: general@developer.marklogic.com From: m...@blakeley.com Date: Fri, 20 Apr 2012 19:49:17 -0700 To: general@developer.marklogic.com Subject: Re: [MarkLogic Dev General] How to give hints to MarkLogic on which condition is faster to check first? Yes, boolean ops will short-circuit. You can test this for yourself using xdmp:sleep and xdmp:elapsed-time. -- Mike On Apr 20, 2012, at 19:15, "seme...@hotmail.com" <seme...@hotmail.com> wrote: I may have some queries where the comparison is expensive. So what I'd like to do is add an extra element in each doc which is a "shortcut" to check for first before doing the expensive comparison. For example, so suppose I had data that had randoms words ("boat", "alligator", "house") in an element called "words" and there was a random number of words in the element (say, 5 to 50). Other documents may have the same words but in a different order. I want to find the documents that have the same number of words and the same words regardless of the order. So 1: <words>boat alligator house bandit flower</words> and 2: <words>bandit alligator bandit flower boat</words>would be a match but 3: <words>bandit alligator bandit flower boat island</words>would not be a match because is has an extra word I thought that I can add a new element to each doc which represents the number of words (5, 6, 11, 20, etc) and I can first check that the doc has the right number of words before I check to see if it has the same words. I am thinking the extra check on the number of words would shortcut the query to not even bother checking individual words if the number of words doesn't match and save me some time. The time may add up if I have millions or tens of millions of docs to query against. So if my thinking is correct, then I would have documents that look like this: <doc> <words>bandit alligator bandit flower boat</words> <num-words>5</num-words></doc> I could put a range index on the "num-words" element of type xs:int. Then I'd like to write queries so that the num-words condition is checked first by the magical MarkLogic engine and only if that first condition is met would it check the rest. I know in Java that the runtime environment won't check the second condition if the first is false in a boolean statement. So: if (1 == 0 && explode()) { .... "explode()" will never be called because the first condition in the statement is false. But the order is important; "1 == 0" must be before "explode()" in the statement because that statement will be evaluated from left to right. I don't know if XQuery or MarkLogic works that way (didn't see anything in the spec) and I know that MarkLogic has all sorts of optimizations, but how will it know that it's faster to check the "num-words" condition before the individual words? Can I write a cts:query that gives a hint to MarkLogic to give precedence to one condition over another to save time? *I* know that the num-words check is faster but how can *MarkLogic* know that? I suppose it could be argued that it doesn't really matter because MarkLogic runs fast anyway, but I'm talking about long running queries over massive data sets so even small amounts of time are important to me. Thanks! -Ryan_______________________________________________ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list General@developer.marklogic.comhttp://developer.marklogic.com/mailman/listinfo/general_______________________________________________ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general