I may have some queries where the comparison is expensive. So what I'd like to 
do is add an extra element in each doc which is a "shortcut" to check for first 
before doing the expensive comparison.
For example, so suppose I had data that had randoms words ("boat", "alligator", 
"house") in an element called "words" and there was a random number of words in 
the element (say, 5 to 50). Other documents may have the same words but in a 
different order. I want to find the documents that have the same number of 
words and the same words regardless of the order.
So 
1: <words>boat alligator house bandit flower</words>
and 2: <words>bandit alligator bandit flower boat</words>would be a match
but 3: <words>bandit alligator bandit flower boat island</words>would not be a 
match because is has an extra word
I thought that I can add a new element to each doc which represents the number 
of words (5, 6, 11, 20, etc) and I can first check that the doc has the right 
number of words before I check to see if it has the same words. I am thinking 
the extra check on the number of words would shortcut the query to not even 
bother checking individual words if the number of words doesn't match and save 
me some time. The time may add up if I have millions or tens of millions of 
docs to query against.
So if my thinking is correct, then I would have documents that look like this:
<doc>    <words>bandit alligator bandit flower boat</words>    
<num-words>5</num-words></doc>
I could put a range index on the "num-words" element of type xs:int.
Then I'd like to write queries so that the num-words condition is checked first 
by the magical MarkLogic engine and only if that first condition is met would 
it check the rest.
I know in Java that the runtime environment won't check the second condition if 
the first is false in a boolean statement. So:
if (1 == 0 && explode()) { ....
"explode()" will never be called because the first condition in the statement 
is false. But the order is important; "1 == 0" must be before "explode()" in 
the statement because that statement will be evaluated from left to right.
I don't know if XQuery or MarkLogic works that way (didn't see anything in the 
spec) and I know that MarkLogic has all sorts of optimizations, but how will it 
know that it's faster to check the "num-words" condition before the individual 
words? Can I write a cts:query that gives a hint to MarkLogic to give 
precedence to one condition over another to save time? *I* know that the 
num-words check is faster but how can *MarkLogic* know that? 
I suppose it could be argued that it doesn't really matter because MarkLogic 
runs fast anyway, but I'm talking about long running queries over massive data 
sets so even small amounts of time are important to me.
Thanks!
-Ryan                                     
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to