I need to design a data element for our platform with an eye to the most
efficient possible retrieval of documents in a collection defined by
this data element. Assume there could be millions of documents. It
will have at least three dimensions: site, content-set, and status;
these are all completely independent. None of these are likely to have
more than a few tens or hundreds of different values: status will have 2
or 3, definitely less than 10.
I need to be able to retrieve documents based on the values of each
dimension independently (ie all; documents in content set X), as well as
(and this could be more typical) a fully-specified vector (content-set,
site and status)
I can think of several possibilities:
1. An element whose text includes all three values as words in some
predefined order:
<collection>cs100 site50 status1</collection>
with word queries for single dimension queries and value (or maybe
phrase queries?) for joins.
2. A ML collection whose name is all three values concatenated in some
order:
collection("cs100-site50-status1")
joins of all three dimensions become a simple collection lookup, and
cts:collection-match() for single- or dual-dimension queries.
3. An element with three attributes:
<collection cs="100" site="50" status="1" />
This is attractive from the perspective of XML modeling and will expose
the values neatly for xpath (perhaps we could combine it with one of the
above), but I'm concerned that:
cts:element-query(collection, ...) might not be as efficient for retrieval?
Also: would we need to enable element-position indexes to make this
accurate as an unfiltered query?
Would anyone care to comment on the "best" design? Other ideas?
Thanks!
--
Michael Sokolov
Engineering Director
www.ifactory.com
@iFactoryBoston
PubFactory: the revolutionary e-publishing platform from iFactory
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general