Re: [MarkLogic Dev General] How to give hints to MarkLogic on which condition is faster to check first?

seme...@hotmail.com Fri, 20 Apr 2012 22:45:48 -0700

But doesn't it still take time to generate the second term list (the expensive 
one)? And if the second term list (expensive) will only be a subset of the 
first term list (inexpensive) wouldn't there be performance gains to not 
intersect the two lists that were created independently (from the entire index) 
but rather create the second list out of the first list?
Something somewhere has got to be checking values against each other and it 
seems like the fewer checks that need to happen the faster things will go. But 
maybe not
-Ryan


From: jhun...@marklogic.com
Date: Fri, 20 Apr 2012 22:37:03 -0700
To: general@developer.marklogic.com
Subject: Re: [MarkLogic Dev General] How to give hints to MarkLogic on which    
condition is faster to check first?

Bottom line: For your word count use-case you should get the performance you 
want if you can write it as a cts:search of a cts:and-query.
But it doesn't quite work like you're thinking.  It's not ordered execution 
like it would be in Java.  It doesn't short-circuit.  To run a cts:search 
there's no looking inside documents (at least not til the final filtering phase 
to verify results).  It's all just term list set arithmetic.
In your example, the num-words constraint will quickly determine the documents 
with the right number of words, and that list of document ids taken from the 
index will be intersected with the document ids matching the other 
constraint(s) based on indexes.  If the other constraints aren't selective (a 
better word to use here than expensive) then it's OK because the first 
constraint is highly selective.  Any documents with phrases that match the 
second constraint but not the first (wrong # of words) won't be included 
because they don't intersect.
Intersecting against long term lists is efficient so there's no need for 
MarkLogic to short circuit.  Your subqueries are resolved to the extent 
possible by the indexes.
-jh-
On Apr 20, 2012, at 8:54 PM, seme...@hotmail.com wrote:So could I do :
cts:search(/,    cts:and-query(        cts:inexpensive-query...        ,        
cts:expensive-query...    ))
and MarkLogic will check the first condition (cts:inexpensive-query) first and 
only check the second condition if the first is true?

CC: general@developer.marklogic.com
From: m...@blakeley.com
Date: Fri, 20 Apr 2012 19:49:17 -0700
To: general@developer.marklogic.com
Subject: Re: [MarkLogic Dev General] How to give hints to MarkLogic on which    
condition is faster to check first?

Yes, boolean ops will short-circuit. You can test this for yourself using 
xdmp:sleep and xdmp:elapsed-time.
-- Mike
On Apr 20, 2012, at 19:15, "seme...@hotmail.com" <seme...@hotmail.com> wrote:

I may have some queries where the comparison is expensive. So what I'd like to 
do is add an extra element in each doc which is a "shortcut" to check for first 
before doing the expensive comparison.
For example, so suppose I had data that had randoms words ("boat", "alligator", 
"house") in an element called "words" and there was a random number of words in 
the element (say, 5 to 50). Other documents may have the same words but in a 
different order. I want to find the documents that have the same number of 
words and the same words regardless of the order.
So 
1: <words>boat alligator house bandit flower</words>
and 2: <words>bandit alligator bandit flower boat</words>would be a match
but 3: <words>bandit alligator bandit flower boat island</words>would not be a 
match because is has an extra word
I thought that I can add a new element to each doc which represents the number 
of words (5, 6, 11, 20, etc) and I can first check that the doc has the right 
number of words before I check to see if it has the same words. I am thinking 
the extra check on the number of words would shortcut the query to not even 
bother checking individual words if the number of words doesn't match and save 
me some time. The time may add up if I have millions or tens of millions of 
docs to query against.
So if my thinking is correct, then I would have documents that look like this:
<doc>    <words>bandit alligator bandit flower boat</words>    
<num-words>5</num-words></doc>
I could put a range index on the "num-words" element of type xs:int.
Then I'd like to write queries so that the num-words condition is checked first 
by the magical MarkLogic engine and only if that first condition is met would 
it check the rest.
I know in Java that the runtime environment won't check the second condition if 
the first is false in a boolean statement. So:
if (1 == 0 && explode()) { ....
"explode()" will never be called because the first condition in the statement 
is false. But the order is important; "1 == 0" must be before "explode()" in 
the statement because that statement will be evaluated from left to right.
I don't know if XQuery or MarkLogic works that way (didn't see anything in the 
spec) and I know that MarkLogic has all sorts of optimizations, but how will it 
know that it's faster to check the "num-words" condition before the individual 
words? Can I write a cts:query that gives a hint to MarkLogic to give 
precedence to one condition over another to save time? *I* know that the 
num-words check is faster but how can *MarkLogic* know that? 
I suppose it could be argued that it doesn't really matter because MarkLogic 
runs fast anyway, but I'm talking about long running queries over massive data 
sets so even small amounts of time are important to me.
Thanks!
-Ryan_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________ General mailing list 
General@developer.marklogic.comhttp://developer.marklogic.com/mailman/listinfo/general_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] How to give hints to MarkLogic on which condition is faster to check first?

Reply via email to