Mike, I see this as a trade-off between ingestion performance (tagging all the
possibilities at ingestion time) and query performance (lots of filtering at
query time). Personally I'd rather use the tagging approach, but let's see how
far we can go with the filtering approach.
The filtering approach has a couple of potential problems. First, if you have N
versions of a doc, the version check will filter out N-1 results. That will
greatly reduce the utility of xdmp:estimate and cts:remainder, which will
probably drive you to count the actual results. That won't scale, so I would
try to avoid that approach.
Second, the version check requires an extra XPath lookup for every match. True,
it will be a fast lookup - but it's still a round-trip to all the forests. If
you could turn it into a range index lookup, that might work better. Something
like this might work:
where $doc/doc/@version
= cts:element-attribute-values(
xs:QName('doc'), xs:QName('version'),
(), ('descending', 'limit=1'),
cts:element-attribute-value-query(
xs:QName('doc'), xs:QName('uri'), $doc/doc/@uri))
But that still means N-1 lookups per match. What if we cache each lookup as
it's made? The server won't do that automatically, but we could add a map.
let $m := map:map()
for $doc in cts:search(doc(), $random-query)
let $uri as xs:string := $doc/doc/@uri
let $version := map:get($m, $uri)
let $version :=
if (exists($version)) then $version
else map:put(
$m, $uri,
cts:element-attribute-values(
xs:QName('doc'), xs:QName('version'),
(), ('descending', 'limit=1'),
cts:element-attribute-value-query(
xs:QName('doc'), xs:QName('uri'), $uri)))
where $doc/@version = $version
return $doc
However, that still won't scale terribly well as the number of versions per
document increases, and the estimate problem is still a problem. We haven't
begun to explore the entitlement problem, which you also mentioned as a
requirement, but that could probably fold into the max-version cts:query.
Another trick to employ is cts:uris(). This requires the uri lexicon, but it
would allow an approach to filtering that might be faster - if you can build
the versions into your URIs. I think that's a good idea anyway, and in this
example I treat the base URI as a directory name, and the parent of the
version-aware URI. That has the nice property of letting you query for all
versions with a directory lookup.
let $matching-uris :=
let $m := map:map()
for $uri in cts:uris((), ('document'), $random-query)
let $version := tokenize($uri, '/')[last()]
let $base := replace($uri, concat('/', $version, '$'), '')
let $latest := map:get($m, $uri)
let $latest :=
if (exists($latest)) then $latest
else map:put(
$m, $base,
cts:element-attribute-values(
xs:QName('doc'), xs:QName('version'),
(), ('descending', 'limit=1'),
cts:directory-query(concat($base, '/'), 'infinity')))
where $version = $latest
return $uri
let $page := cts:search(
doc(),
cts:and-query(
($random-query, cts:document-query($matching-uris))))[
$start to $stop]
return element page {
attribute start { $start },
attribute stop { $stop },
attribute total { count($matching-uris) },
$page
}
That's untested code, and I can't promise that it will perform well enough to
avoid using pre-calculated tags. But working with a large number of strings is
easier than working with a large number of fragments, so I expect it to be
faster than the previous approach.
You might be able to replace the cts:search call with search:search and an
additional-query option, so that it handles highlighting and pagination for
you. You could wrap the cts:directory-query with an and-query and insert your
entitlement query, too.
To give this approach an extra boost, you might consider boosting the quality
of each new version by something like $factor * $version, where factor is 100
or 1000. That should tend to put the latest version of each document at the
beginning of the search results, which would help the paginated cts:search
portion of the code. The drawback is that the documents that have more versions
will also tend to appear first, which may not be desirable. You could boost the
quality of the latest version by a fixed amount, too, and zero the quality of
all previous versions. But that makes updates more expensive, and might not be
a win if your entitlement model ends up selecting versions at random.
-- Mike
On 21 Aug 2011, at 18:08 , Michael Sokolov wrote:
> I'm looking for advice about how best to solve a querying problem for
> one of our customers using MarkLogic. We need to be able to search
> documents that are stored in various versions, and we only want to
> search the newest version. What I am interested in here is whether it is
> possible to do this if the "newest version" property has to be computed,
> not stored. So something like:
>
> for $doc in cts:search(doc(), $random-query)
> where $doc/@version = max(doc()/doc[@uri=$doc/@uri]/@version)
> return $doc
>
> our documents store the document identifier that associates different
> versions as /doc/@uri
>
> Is that it? Maybe a range index on /doc/@uri would help there?
>
> The real kicker is that we also need to be apply an additional
> constraint in that some users may have access only to certain
> document-versions, so in those cases we need to search only the newest
> accessible version of each doc.
>
> Of course there can be lots of docs matching the $random-query, we want
> to be able to apply sorting criteria, and get the first page of results
> efficiently.
>
> Currently I am planning to generate tags at load-time that should make
> the querying efficient (basically I will mark, for every possible
> accessibility condition, the most current version - this is possible, if
> irritating, due to the structure of the access control rules), but this
> will introduce pain during document ingestion, and relies on
> restrictions of the kind of access control rules we can have. So I'm
> wondering if there is a passable query-time implementation I could use.
>
> -Mike
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general