Re: [MarkLogic Dev General] searching document versions

Michael Blakeley Thu, 25 Aug 2011 11:08:09 -0700

Mike, I see this as a trade-off between ingestion performance (tagging all the 
possibilities at ingestion time) and query performance (lots of filtering at 
query time). Personally I'd rather use the tagging approach, but let's see how 
far we can go with the filtering approach.

The filtering approach has a couple of potential problems. First, if you have N 
versions of a doc, the version check will filter out N-1 results. That will 
greatly reduce the utility of xdmp:estimate and cts:remainder, which will 
probably drive you to count the actual results. That won't scale, so I would 
try to avoid that approach.

Second, the version check requires an extra XPath lookup for every match. True, 
it will be a fast lookup - but it's still a round-trip to all the forests. If 
you could turn it into a range index lookup, that might work better. Something 
like this might work:

  where $doc/doc/@version
    = cts:element-attribute-values(
      xs:QName('doc'), xs:QName('version'),
      (), ('descending', 'limit=1'),
      cts:element-attribute-value-query(
        xs:QName('doc'), xs:QName('uri'), $doc/doc/@uri))

But that still means N-1 lookups per match. What if we cache each lookup as 
it's made? The server won't do that automatically, but we could add a map.

  let $m := map:map()
  for $doc in cts:search(doc(), $random-query)
  let $uri as xs:string := $doc/doc/@uri
  let $version := map:get($m, $uri)
  let $version :=
    if (exists($version)) then $version
    else map:put(
      $m, $uri,
      cts:element-attribute-values(
        xs:QName('doc'), xs:QName('version'),
        (), ('descending', 'limit=1'),
        cts:element-attribute-value-query(
          xs:QName('doc'), xs:QName('uri'), $uri)))
  where $doc/@version = $version
  return $doc

However, that still won't scale terribly well as the number of versions per 
document increases, and the estimate problem is still a problem. We haven't 
begun to explore the entitlement problem, which you also mentioned as a 
requirement, but that could probably fold into the max-version cts:query.

Another trick to employ is cts:uris(). This requires the uri lexicon, but it 
would allow an approach to filtering that might be faster - if you can build 
the versions into your URIs. I think that's a good idea anyway, and in this 
example I treat the base URI as a directory name, and the parent of the 
version-aware URI. That has the nice property of letting you query for all 
versions with a directory lookup.

  let $matching-uris :=
    let $m := map:map()
    for $uri in cts:uris((), ('document'), $random-query)
    let $version := tokenize($uri, '/')[last()]
    let $base := replace($uri, concat('/', $version, '$'), '')
    let $latest := map:get($m, $uri)
    let $latest :=
      if (exists($latest)) then $latest
      else map:put(
        $m, $base,
        cts:element-attribute-values(
          xs:QName('doc'), xs:QName('version'),
          (), ('descending', 'limit=1'),
        cts:directory-query(concat($base, '/'), 'infinity')))
    where $version = $latest
    return $uri
  let $page := cts:search(
    doc(),
    cts:and-query(
      ($random-query, cts:document-query($matching-uris))))[
    $start to $stop]
  return element page {
    attribute start { $start },
    attribute stop { $stop },
    attribute total { count($matching-uris) },
    $page
  }

That's untested code, and I can't promise that it will perform well enough to 
avoid using pre-calculated tags. But working with a large number of strings is 
easier than working with a large number of fragments, so I expect it to be 
faster than the previous approach.

You might be able to replace the cts:search call with search:search and an 
additional-query option, so that it handles highlighting and pagination for 
you. You could wrap the cts:directory-query with an and-query and insert your 
entitlement query, too.

To give this approach an extra boost, you might consider boosting the quality 
of each new version by something like $factor * $version, where factor is 100 
or 1000. That should tend to put the latest version of each document at the 
beginning of the search results, which would help the paginated cts:search 
portion of the code. The drawback is that the documents that have more versions 
will also tend to appear first, which may not be desirable. You could boost the 
quality of the latest version by a fixed amount, too, and zero the quality of 
all previous versions. But that makes updates more expensive, and might not be 
a win if your entitlement model ends up selecting versions at random.

-- Mike

On 21 Aug 2011, at 18:08 , Michael Sokolov wrote:

> I'm looking for advice about how best to solve a querying problem for 
> one of our customers using MarkLogic.  We need to be able to search 
> documents that are stored in various versions, and we only want to 
> search the newest version. What I am interested in here is whether it is 
> possible to do this if the "newest version" property has to be computed, 
> not stored.  So something like:
> 
> for $doc in cts:search(doc(), $random-query)
> where $doc/@version = max(doc()/doc[@uri=$doc/@uri]/@version)
> return $doc
> 
> our documents store the document identifier that associates different 
> versions as /doc/@uri
> 
> Is that it? Maybe a range index on /doc/@uri would help there?
> 
> The real kicker is that we also need to be apply an additional 
> constraint in that some users may have access only to certain 
> document-versions, so in those cases we need to search only the newest 
> accessible version of each doc.
> 
> Of course there can be lots of docs matching the $random-query, we want 
> to be able to apply sorting criteria, and get the first page of results 
> efficiently.
> 
> Currently I am planning to generate tags at load-time that should make 
> the querying efficient (basically I will mark, for every possible 
> accessibility condition, the most current version - this is possible, if 
> irritating, due to the structure of the access control rules), but this 
> will introduce pain during document ingestion, and relies on 
> restrictions of the kind of access control rules we can have. So I'm 
> wondering if there is a passable query-time implementation I could use.
> 
> -Mike
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] searching document versions

Reply via email to