I need to design a data element for our platform with an eye to the most 
efficient possible retrieval of documents in a collection defined by 
this data element.  Assume there could be millions of documents.  It 
will have at least three dimensions: site, content-set, and status; 
these are all completely independent.  None of these are likely to have 
more than a few tens or hundreds of different values: status will have 2 
or 3, definitely less than 10.

I need to be able to retrieve documents based on the values of each 
dimension independently (ie all; documents in content set X), as well as 
(and this could be more typical) a fully-specified vector (content-set, 
site and status)

I can think of several possibilities:

1. An element whose text includes all three values as words in some 
predefined order:

<collection>cs100 site50 status1</collection>

with word queries for single dimension queries and value (or maybe 
phrase queries?) for joins.

2. A ML collection whose name is all three values concatenated in some 
order:

collection("cs100-site50-status1")

joins of all three dimensions become a simple collection lookup, and 
cts:collection-match() for single- or dual-dimension queries.

3. An element with three attributes:
<collection cs="100" site="50" status="1" />
This is attractive from the perspective of XML modeling and will expose 
the values neatly for xpath (perhaps we could combine it with one of the 
above), but I'm concerned that:
cts:element-query(collection, ...) might not be as efficient for retrieval?
Also: would we need to enable element-position indexes to make this 
accurate as an unfiltered query?

Would anyone care to comment on the "best" design?  Other ideas?

Thanks!

-- 
Michael Sokolov
Engineering Director
www.ifactory.com
@iFactoryBoston

PubFactory: the revolutionary e-publishing platform from iFactory

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to