I may have some queries where the comparison is expensive. So what I'd like to
do is add an extra element in each doc which is a "shortcut" to check for first
before doing the expensive comparison.
For example, so suppose I had data that had randoms words ("boat", "alligator",
"house") in an element called "words" and there was a random number of words in
the element (say, 5 to 50). Other documents may have the same words but in a
different order. I want to find the documents that have the same number of
words and the same words regardless of the order.
So
1: <words>boat alligator house bandit flower</words>
and 2: <words>bandit alligator bandit flower boat</words>would be a match
but 3: <words>bandit alligator bandit flower boat island</words>would not be a
match because is has an extra word
I thought that I can add a new element to each doc which represents the number
of words (5, 6, 11, 20, etc) and I can first check that the doc has the right
number of words before I check to see if it has the same words. I am thinking
the extra check on the number of words would shortcut the query to not even
bother checking individual words if the number of words doesn't match and save
me some time. The time may add up if I have millions or tens of millions of
docs to query against.
So if my thinking is correct, then I would have documents that look like this:
<doc> <words>bandit alligator bandit flower boat</words>
<num-words>5</num-words></doc>
I could put a range index on the "num-words" element of type xs:int.
Then I'd like to write queries so that the num-words condition is checked first
by the magical MarkLogic engine and only if that first condition is met would
it check the rest.
I know in Java that the runtime environment won't check the second condition if
the first is false in a boolean statement. So:
if (1 == 0 && explode()) { ....
"explode()" will never be called because the first condition in the statement
is false. But the order is important; "1 == 0" must be before "explode()" in
the statement because that statement will be evaluated from left to right.
I don't know if XQuery or MarkLogic works that way (didn't see anything in the
spec) and I know that MarkLogic has all sorts of optimizations, but how will it
know that it's faster to check the "num-words" condition before the individual
words? Can I write a cts:query that gives a hint to MarkLogic to give
precedence to one condition over another to save time? *I* know that the
num-words check is faster but how can *MarkLogic* know that?
I suppose it could be argued that it doesn't really matter because MarkLogic
runs fast anyway, but I'm talking about long running queries over massive data
sets so even small amounts of time are important to me.
Thanks!
-Ryan _______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general