In Query Console, I compared the output of this query:
exists(collection()),
xdmp:query-meters()
with this query:
xdmp:estimate(collection()),
xdmp:query-meters()
The latter is consistently much faster than the former, according to the output
of query meters. It looks like the tree data must be accessed in the first case
(even if only one fragment), but not at all in the latter. And if that fragment
isn't already in the expanded tree cache, then it runs that much slower.
I don't know if this will completely solve your problem, but I think Geert's
right: start by replacing "exists(" with "xdmp:estimate(" (and let 0 convert to
false).
Evan Lenz
Software Developer, Community
MarkLogic Corporation
developer.marklogic.com<http://developer.marklogic.com>
From: Geert Josten <[email protected]<mailto:[email protected]>>
Reply-To: General MarkLogic Developer Discussion
<[email protected]<mailto:[email protected]>>
Date: Thu, 17 Nov 2011 11:53:50 -0800
To: General MarkLogic Developer Discussion
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] "Joins" in search:search or cts:search
Hi Lee,
Actually, the exists() in your code might be the slowest part of your code. The
collection call is likely backed by an index, so quick. The exists works on a
sequence however. It could be that it is optimized under the hood to use
xdmp:estimate in this case, but not sure. Could try to rewrite that. But
actually, I would test at all.
A collection-delete of an empty collection won’t take time I’d say. So wouldn’t
worry about that too much.
What remains is the initial collection, which returns a sequence. If you are
not collecting the results, MarkLogic doesn’t need to keep it in memory. Could
very well be that it is streamed in the outer for loop. Otherwise try chunking
it in batches of 10k. Remember that deletes in ML are fast! It’s just a flag on
each fragment..
Kind regards,
Geert
Van:
[email protected]<mailto:[email protected]>
[mailto:[email protected]<mailto:[email protected]>]
Namens Lee, David
Verzonden: donderdag 17 november 2011 20:41
Aan: General Mark Logic Developer Discussion
([email protected]<mailto:[email protected]>)
Onderwerp: [MarkLogic Dev General] "Joins" in search:search or cts:search
I suspect the answer is "no" ... but just plugging the brains out there ..
For good or bad I use this architype.
I have many "summary" documents say "/logs/1.xml" , "/logs/2.xml" which
belongs to the collection "/summaries"
There can be many (100k+)
Each summary document lists a refernce to external URL's (in this case Amazon
S3) from which data could be loaded.
If I load the data I put each group into a collection named by the URL of the
summary.
So say I have 10,000 XML documents referenced by doc("/logs/1.xml") If I
choose to load them, they will end up in collection
"/logs/1.xml". These summaries are in the collection say "/summaries"
The reason for this is for the ability to easily bulk delete blocks of
documents based on their summaries.
I can list the summaries and by a simple
exists( collection( $url) )
cant tell if any actual log documents have been loaded.
NOW: I want to be able to delete all records by summary but only if the
documents have been loaded.
Suppose I had 100k summary URL's I could do
for $url in collection("/summaries")
if( exists( collection( $url) ) then
xdmp:collection-delete($url)
else ()
This works and all ... but suppose I want something more efficiient.
Overall there may be only say 1% of the summary documents actually loaded.
Furthermore if there were LOTS of ones loaded the above would timeout.
So I spawn a thread to delete say [1 to 10] of every summary collection ...
but say I have 100k collections most of the threads do nothing.
So I have to revert to the above to first check if the collection has anything
before spawning a thread.
Quesiton: Is there a cts:search option which can do a collection query based
on the results of the search itself ?
that is (pseudo code)
in one cts:search
for $c in collection("x")/document-uri(.)
if( exists( collection( $c) )
return $c
doing this in FLOWR is very slow ...
but its what I'm resorting to ....
----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]>
812-482-5224
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general