Re: [MarkLogic Dev General] Lots of collections ...

Lee, David Wed, 20 Jul 2011 09:23:24 -0700

Great Questions !

1) do I need range indexes on the attributes ?
Maybe.  My original need was to efficiently produce a set of unique values of 
say  log/@host quickly.
Without the range index it took quite a while to execute  (  
fn:distinct-values( $docs/log/@host/string())
With a range index I was able to almost instantly get such a list.
As my prototype evolves I may not actually need this ...



2) Why use collections ?
A) Trying to learn how they work
B) I've discovered (with lots of hints) that collections can participate in 
very fast document deletes.  This is crucial to my app.
   I was able to delete 500,000 docs in a collection in just a few seconds.  
When they were in a directory or done one-by-one via URL's it took minutes 
(sometimes hours).
C) I plan on doing a lot of searching which is primarily filtered by these few 
attributes, then more criteria added at runtime.  Collections seem like an 
interesting approach (see #A).  but maybe are pointless !


3) When do I assign the collection ! ?
Great question.
As it so happens, at the time of data load I happen to have in-memory the 
values of these particular attributes and when I run my put command 
(http://www.xmlsh.org/MarkLogicPut )  It is quite trivial to append a list of 
collections for the set of documents I am uploading.

Also I was considering triggers ... which may be vastly overkill or expensive, 
but might be worth playing with if only to learn them (see #A)





----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]>
812-482-5224

From: [email protected] 
[mailto:[email protected]] On Behalf Of Evan Lenz
Sent: Wednesday, July 20, 2011 11:53 AM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Lots of collections ...

Interesting approach. I have a few questions. First of all, do you even need 
either (range indexes or collections)? The query example you gave should be 
resolvable from the Universal Index alone. I tried this in CQ (after creating 
300 sample <logfile> documents):

xdmp:query-trace(true()),
//logfile[@host eq 'host1']

And then I looked in the logfile:

Analyzing path: fn:collection()/descendant::logfile[@host eq "host1"]
Step 1 is searchable: fn:collection()
Step 2 is searchable: descendant::logfile[@host eq "host1"]
Path is fully searchable.
Gathering constraints.
Comparison contributed hash value constraint: logfile/@host = "host1"
Step 2 predicate 1 contributed 1 constraint: @host eq "host1"
Comparison contributed hash value constraint: logfile/@host = "host1"
Step 2 predicate 1 contributed 1 constraint: @host eq "host1"
Step 2 contributed 2 constraints: descendant::logfile[@host eq "host1"]
Executing search.
Selected 100 fragments to filter

The above told me that the result was completely resolved from the Universal 
Index since I haven't enabled any ranged indexes (and I know that exactly 100 
of my sample docs have host="host1").

My other two questions:
*         What is your main motivation for using collections rather than 
attribute range indexes?
*         How do you plan to associate the documents with the collection URIs?
Thanks,

Evan Lenz
Software Developer, Community
developer.marklogic.com<http://developer.marklogic.com>
From: "Lee, David" <[email protected]<mailto:[email protected]>>
Reply-To: General MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Tue, 19 Jul 2011 14:34:39 -0700
To: "General Mark Logic Developer Discussion 
([email protected]<mailto:[email protected]>)" 
<[email protected]<mailto:[email protected]>>
Subject: [MarkLogic Dev General] Lots of collections ...

Thanks to some tips from this group (and especially Kelly !)
I've started leveraging collections instead of directories.  So far really 
fantastic results !!!
Thank you all !!

Of course one success opens the doors to a million questions ...

Question ... Is there a significant cost to having a 'large' number of 
overlapping  documents in collections ?
In my use case I may have millions of very similar small documents all with 
some basic set of attributes which have a small set of possible values.   I've 
implemented attribute value range indexes, but was wondering if collections 
might work better ?
A typical use case would be to filter a result set by only those documents with 
a particular attribute set to one value.
If I had collections for each attribute/value combination  (maybe 100 
collections max) A collection query could do the equivalent of a range index.
Example:

<logfile host="host1" system="tomcat" ...>
   ...

Instead of making a range index on logfile/@host and logfile/@system
Make collections called    host-host1  host-host2  host-host3  ... and 
system-tomcat system-mysql ...
Then this xpath
//logfile[@host eq 'host1']

would be equivalent to a collection search on 'host-host1'

Is this brilliant or stupid ?  Obviously there will be a tradeoff ... but I'm 
thinking in this case since the number of possible values is very small that 
collections might actually be a good thing.

-David





----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]>
812-482-5224

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Lots of collections ...

Reply via email to