This is getting seriously interesting.
I'll try a few of these and report back to the group with my results on my own 
dataset.


----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]>
812-482-5224

From: [email protected] 
[mailto:[email protected]] On Behalf Of Jason Hunter
Sent: Monday, November 21, 2011 12:05 AM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] "Joins" in search: search or cts:search 
(Damon Feldman)

The "uri lexicon" and "collection lexicon" database settings can be thought of 
as enabling these range indexes:

 * type=anyUri, namespace=http://marklogic.com/xdmp, localname=document

 * type=anyUri, namespace=http://marklogic.com/xdmp, localname=collection

Example calls:

(: This is like cts:uris() :)
cts:element-values(xs:QName("xdmp:document"))[1 to 10]

(: There's no other way to express this, cool eh :)
cts:element-value-co-occurrences(
  xs:QName("xdmp:document"),
  xs:QName("xdmp:collection")
)[1 to 10]

If you got a complaint, Evan, it's probably because you didn't have the two 
lexicons enabled.

-jh-

On Nov 19, 2011, at 10:43 AM, Evan Lenz wrote:


That sounds promising, but I don't think it would help much here, since the aim 
is to find, given a set of document URIs, all the string-equal collection URIs 
that exist (but in practice apply to different documents, i.e. do not co-occur).

However, I'm still wondering: how do you express a co-occurrence call on 
document and collection URIs? I tried using the QNames "xdmp:document" and 
"xdmp:collection" with cts:element-value-co-occurrences() but the server 
complained. Is there a more up-to-date way of doing this?

Evan

From: Kelly Stirman 
<[email protected]<mailto:[email protected]>>
Reply-To: General MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Sat, 19 Nov 2011 08:52:52 -0800
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] "Joins" in search: search or cts:search 
(Damon Feldman)

You could also do co-occurrence of the document and collection uris.
________________________________
From: 
[email protected]<mailto:[email protected]>
Sent: 11/19/2011 8:16 AM
To: [email protected]<mailto:[email protected]>
Subject: General Digest, Vol 89, Issue 80
Send General mailing list submissions to
        [email protected]<mailto:[email protected]>

To subscribe or unsubscribe via the World Wide Web, visit
        http://developer.marklogic.com/mailman/listinfo/general
or, via email, send a message with subject or body 'help' to
        
[email protected]<mailto:[email protected]>

You can reach the person managing the list at
        
[email protected]<mailto:[email protected]>

When replying, please edit your Subject line so it is more specific
than "Re: Contents of General digest..."


Today's Topics:

   1. Re: "Joins" in search: search or cts:search (Damon Feldman)


----------------------------------------------------------------------

Message: 1
Date: Sat, 19 Nov 2011 08:15:50 -0800
From: Damon Feldman 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] "Joins" in search: search or
        cts:search
To: General MarkLogic Developer Discussion
        
<[email protected]<mailto:[email protected]>>
Message-ID:
        
<d20c296d14127d4ebd176ad949d8a75a0600ce4...@exchg-be.marklogic.com<mailto:d20c296d14127d4ebd176ad949d8a75a0600ce4...@exchg-be.marklogic.com>>
Content-Type: text/plain; charset="us-ascii"

Great solution.

My guess on performance is that it will be very good (and functional). To do 
cts:collections($uri, "limit=1") many times will (I assume) have to do a bunch 
of binary searches through the URI lexicon for each URI in /summaries, which 
may be slower than a single hash-based intersection, but then again it avoids 
pulling back the entire collection lexicon, and avoids the procedural flavor of 
map:put().

I'm be interested to know how it performs in reality. That kind of thing is 
where performance tuning becomes interesting.

Damon

From: 
[email protected]<mailto:[email protected]>
 [mailto:[email protected]] On Behalf Of Evan Lenz
Sent: Saturday, November 19, 2011 1:51 AM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] "Joins" in search: search or cts:search

Glad you found a working solution!

But I hope you don't mind if I still try to rescue that frozen dog. :-)  
Damon's solution using maps is probably the best, but I've been wondering if 
there was a purely functional yet still performant way to do it. Then this 
popped into my head while brushing my teeth tonight:

for $uri in cts:uris("",(),cts:collection-query("/summaries"))
return cts:collections($uri,"limit=1")[. eq $uri]

Damon, does that fit the bill?

Evan

From: "Lee, David" 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]%3E>>
Reply-To: General MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]%3E>>
Date: Fri, 18 Nov 2011 07:21:53 -0800
To: General MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]%3E>>
Subject: Re: [MarkLogic Dev General] "Joins" in search: search or cts:search

Thanks !
While the most elegant solution so far posted :)
It's also slower than a dog with his feet frozen in mud.
On my machine it took about 3 minutes to return 200 URL's.


Its ok though I found a different way that's faster and I'll save the details 
because the problem is actually more complex than the original question (and so 
the solution is different as well).
But the core is I ended up using collection-match()  ... turns out the 
collections in question have a common naming convention so I didn't use a join 
after all ...

But thanks all for the interesting ideas !



----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]><mailto:[email protected]>
812-482-5224

From: 
[email protected]<mailto:[email protected]><mailto:[email protected]>
 [mailto:[email protected]] On Behalf Of Evan Lenz
Sent: Thursday, November 17, 2011 6:44 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] "Joins" in search:search or cts:search

I think I've got it. You want to join between two lexicons. You can limit your 
collection URIs to being those that are the same as doc URIs in another 
collection (like "/summaries" in your case). Enable the URI lexicon and join 
between it and the collection lexicon:

cts:collections()[. = cts:uris("",(),cts:collection-query("/summaries"))]

Evan

From: Evan Lenz 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]%3E>>
Reply-To: General MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]%3E>>
Date: Thu, 17 Nov 2011 15:22:29 -0800
To: General MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]%3E>>
Subject: Re: [MarkLogic Dev General] "Joins" in search:search or cts:search

Just brainstorming a bit, but if you enable the collection lexicon, then you 
could query for the list of existing collections using cts:collections(), or 
better yet, using cts:collection-match():

for $uri in cts:collection-match("/logs/*") return 
xdmp:estimate(collection($uri))

I believe that "0" should not appear in the resulting list, because otherwise, 
the collection wouldn't exist. (A collection only exists by virtue of a 
document being associated with it.) cts:collection-match("/logs/*") will return 
all the collection URIs matching that pattern, and since, if I'm right that 
there's no such thing as an empty collection, you won't ever need to check if 
it's empty. So it seems like you could confidently spawn collection deletes on 
all the existing "/logs/*" collections that way.

Evan

From: "Lee, David" 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]%3E>>
Reply-To: General MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]%3E>>
Date: Thu, 17 Nov 2011 11:41:15 -0800
To: "General Mark Logic Developer Discussion 
([email protected]<mailto:[email protected]><mailto:[email protected]>)<mailto:[email protected]%3E)>"
 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]%3E>>
Subject: [MarkLogic Dev General] "Joins" in search:search or cts:search

I suspect the answer is "no" ... but just plugging the brains out there ..

For good or bad I use this architype.

I have many "summary" documents  say  "/logs/1.xml" , "/logs/2.xml"  which 
belongs to the collection "/summaries"

There can be many (100k+)

Each summary document lists a refernce to external URL's (in this case Amazon 
S3) from which data could be loaded.
If I load the data I put each group into a collection named by the URL of the 
summary.
So say I have 10,000 XML documents   referenced by doc("/logs/1.xml") If I 
choose to load them, they will end up in collection
"/logs/1.xml".   These summaries are in the collection say "/summaries"

The reason for this is for the ability to easily bulk delete blocks of 
documents based on their summaries.
I can list the summaries and by a simple
                exists( collection( $url) )

cant tell if any actual log documents have been loaded.


NOW:  I want to be able to delete all records by summary but only if the 
documents have been loaded.
Suppose I had 100k summary URL's I could do

                for $url in collection("/summaries")
                                if( exists( collection( $url) )  then
                                                xdmp:collection-delete($url)
                                else ()


This works and all ... but suppose I want something more efficiient.
Overall there may be only say 1% of the summary documents actually loaded.  
Furthermore if there were LOTS of ones loaded the above would timeout.

So I spawn a thread to delete say [1 to 10] of every summary collection ...
but say I have 100k collections most of the threads do nothing.
So I have to revert to the above to first check if the collection has anything 
before spawning a thread.

Quesiton:   Is there a cts:search  option which can do a collection query based 
on the results of the search itself ?
that is (pseudo code)
in one cts:search

    for $c in collection("x")/document-uri(.)
                if( exists( collection( $c) )
                                return $c

doing this in FLOWR is very slow ...
but its what I'm resorting to ....











----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]><mailto:[email protected]>
812-482-5224

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://developer.marklogic.com/pipermail/general/attachments/20111119/c5d016ed/attachment.html

------------------------------

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general


End of General Digest, Vol 89, Issue 80
***************************************
_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to