Re: [MarkLogic Dev General] efficient storage/retrieval scheme

Damon Feldman Wed, 02 Mar 2011 07:20:51 -0800

Mike,

Items 2 and 3 are both great approaches, IMO. Item 1 seems less clean and I 
suggest keeping discrete system values in attributes instead.


Lexicons, including a collection lexicon, uri lexicon, or range indexs, are 
always the fastest, but do consume more memory. That may be a premature 
optimization becuase using your attribute approach should be very fast. A 
cts:and-query() with three cts:element-attribute-value() queries will then work.

I suggest you build a loop to add a representative sample of random, simple 
documents, and try it out. You won't need large test documents, because the 
index resolution phase is independent of document content.

Damon


________________________________________
From: [email protected] 
[[email protected]] On Behalf Of Mike Sokolov 
[[email protected]]
Sent: Wednesday, March 02, 2011 9:30 AM
To: Mark Logic
Subject: [MarkLogic Dev General] efficient storage/retrieval scheme

I need to design a data element for our platform with an eye to the most
efficient possible retrieval of documents in a collection defined by
this data element.  Assume there could be millions of documents.  It
will have at least three dimensions: site, content-set, and status;
these are all completely independent.  None of these are likely to have
more than a few tens or hundreds of different values: status will have 2
or 3, definitely less than 10.

I need to be able to retrieve documents based on the values of each
dimension independently (ie all; documents in content set X), as well as
(and this could be more typical) a fully-specified vector (content-set,
site and status)

I can think of several possibilities:

1. An element whose text includes all three values as words in some
predefined order:

<collection>cs100 site50 status1</collection>

with word queries for single dimension queries and value (or maybe
phrase queries?) for joins.

2. A ML collection whose name is all three values concatenated in some
order:

collection("cs100-site50-status1")

joins of all three dimensions become a simple collection lookup, and
cts:collection-match() for single- or dual-dimension queries.

3. An element with three attributes:
<collection cs="100" site="50" status="1" />
This is attractive from the perspective of XML modeling and will expose
the values neatly for xpath (perhaps we could combine it with one of the
above), but I'm concerned that:
cts:element-query(collection, ...) might not be as efficient for retrieval?
Also: would we need to enable element-position indexes to make this
accurate as an unfiltered query?

Would anyone care to comment on the "best" design?  Other ideas?

Thanks!

--
Michael Sokolov
Engineering Director
www.ifactory.com
@iFactoryBoston

PubFactory: the revolutionary e-publishing platform from iFactory

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] efficient storage/retrieval scheme

Reply via email to