Mike, Items 2 and 3 are both great approaches, IMO. Item 1 seems less clean and I suggest keeping discrete system values in attributes instead.
Lexicons, including a collection lexicon, uri lexicon, or range indexs, are always the fastest, but do consume more memory. That may be a premature optimization becuase using your attribute approach should be very fast. A cts:and-query() with three cts:element-attribute-value() queries will then work. I suggest you build a loop to add a representative sample of random, simple documents, and try it out. You won't need large test documents, because the index resolution phase is independent of document content. Damon ________________________________________ From: [email protected] [[email protected]] On Behalf Of Mike Sokolov [[email protected]] Sent: Wednesday, March 02, 2011 9:30 AM To: Mark Logic Subject: [MarkLogic Dev General] efficient storage/retrieval scheme I need to design a data element for our platform with an eye to the most efficient possible retrieval of documents in a collection defined by this data element. Assume there could be millions of documents. It will have at least three dimensions: site, content-set, and status; these are all completely independent. None of these are likely to have more than a few tens or hundreds of different values: status will have 2 or 3, definitely less than 10. I need to be able to retrieve documents based on the values of each dimension independently (ie all; documents in content set X), as well as (and this could be more typical) a fully-specified vector (content-set, site and status) I can think of several possibilities: 1. An element whose text includes all three values as words in some predefined order: <collection>cs100 site50 status1</collection> with word queries for single dimension queries and value (or maybe phrase queries?) for joins. 2. A ML collection whose name is all three values concatenated in some order: collection("cs100-site50-status1") joins of all three dimensions become a simple collection lookup, and cts:collection-match() for single- or dual-dimension queries. 3. An element with three attributes: <collection cs="100" site="50" status="1" /> This is attractive from the perspective of XML modeling and will expose the values neatly for xpath (perhaps we could combine it with one of the above), but I'm concerned that: cts:element-query(collection, ...) might not be as efficient for retrieval? Also: would we need to enable element-position indexes to make this accurate as an unfiltered query? Would anyone care to comment on the "best" design? Other ideas? Thanks! -- Michael Sokolov Engineering Director www.ifactory.com @iFactoryBoston PubFactory: the revolutionary e-publishing platform from iFactory _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
