Re: [MarkLogic Dev General] searching document versions

Mike Sokolov Fri, 26 Aug 2011 14:57:38 -0700

Mike - you've given us a lot to chew on - thanks for that.  These are 
all really useful suggestions, which, even if we don't ending up using 
them here (since I think we are all coming around to the load-time 
tagging approach), will I feel sure will come in handy at some point.


-Mike

PS the blog comment fail is totally on my end.  My firefox is acting up, 
weirdly - some sites time out, but only in ff?  Anyway- off topic

On 08/25/2011 02:08 PM, Michael Blakeley wrote:
> Mike, I see this as a trade-off between ingestion performance (tagging all 
> the possibilities at ingestion time) and query performance (lots of filtering 
> at query time). Personally I'd rather use the tagging approach, but let's see 
> how far we can go with the filtering approach.
>
> The filtering approach has a couple of potential problems. First, if you have 
> N versions of a doc, the version check will filter out N-1 results. That will 
> greatly reduce the utility of xdmp:estimate and cts:remainder, which will 
> probably drive you to count the actual results. That won't scale, so I would 
> try to avoid that approach.
>
> Second, the version check requires an extra XPath lookup for every match. 
> True, it will be a fast lookup - but it's still a round-trip to all the 
> forests. If you could turn it into a range index lookup, that might work 
> better. Something like this might work:
>
>    where $doc/doc/@version
>      = cts:element-attribute-values(
>        xs:QName('doc'), xs:QName('version'),
>        (), ('descending', 'limit=1'),
>        cts:element-attribute-value-query(
>          xs:QName('doc'), xs:QName('uri'), $doc/doc/@uri))
>
> But that still means N-1 lookups per match. What if we cache each lookup as 
> it's made? The server won't do that automatically, but we could add a map.
>
>    let $m := map:map()
>    for $doc in cts:search(doc(), $random-query)
>    let $uri as xs:string := $doc/doc/@uri
>    let $version := map:get($m, $uri)
>    let $version :=
>      if (exists($version)) then $version
>      else map:put(
>        $m, $uri,
>        cts:element-attribute-values(
>          xs:QName('doc'), xs:QName('version'),
>          (), ('descending', 'limit=1'),
>          cts:element-attribute-value-query(
>            xs:QName('doc'), xs:QName('uri'), $uri)))
>    where $doc/@version = $version
>    return $doc
>
> However, that still won't scale terribly well as the number of versions per 
> document increases, and the estimate problem is still a problem. We haven't 
> begun to explore the entitlement problem, which you also mentioned as a 
> requirement, but that could probably fold into the max-version cts:query.
>
> Another trick to employ is cts:uris(). This requires the uri lexicon, but it 
> would allow an approach to filtering that might be faster - if you can build 
> the versions into your URIs. I think that's a good idea anyway, and in this 
> example I treat the base URI as a directory name, and the parent of the 
> version-aware URI. That has the nice property of letting you query for all 
> versions with a directory lookup.
>
>    let $matching-uris :=
>      let $m := map:map()
>      for $uri in cts:uris((), ('document'), $random-query)
>      let $version := tokenize($uri, '/')[last()]
>      let $base := replace($uri, concat('/', $version, '$'), '')
>      let $latest := map:get($m, $uri)
>      let $latest :=
>        if (exists($latest)) then $latest
>        else map:put(
>          $m, $base,
>          cts:element-attribute-values(
>            xs:QName('doc'), xs:QName('version'),
>            (), ('descending', 'limit=1'),
>          cts:directory-query(concat($base, '/'), 'infinity')))
>      where $version = $latest
>      return $uri
>    let $page := cts:search(
>      doc(),
>      cts:and-query(
>        ($random-query, cts:document-query($matching-uris))))[
>      $start to $stop]
>    return element page {
>      attribute start { $start },
>      attribute stop { $stop },
>      attribute total { count($matching-uris) },
>      $page
>    }
>
> That's untested code, and I can't promise that it will perform well enough to 
> avoid using pre-calculated tags. But working with a large number of strings 
> is easier than working with a large number of fragments, so I expect it to be 
> faster than the previous approach.
>
> You might be able to replace the cts:search call with search:search and an 
> additional-query option, so that it handles highlighting and pagination for 
> you. You could wrap the cts:directory-query with an and-query and insert your 
> entitlement query, too.
>
> To give this approach an extra boost, you might consider boosting the quality 
> of each new version by something like $factor * $version, where factor is 100 
> or 1000. That should tend to put the latest version of each document at the 
> beginning of the search results, which would help the paginated cts:search 
> portion of the code. The drawback is that the documents that have more 
> versions will also tend to appear first, which may not be desirable. You 
> could boost the quality of the latest version by a fixed amount, too, and 
> zero the quality of all previous versions. But that makes updates more 
> expensive, and might not be a win if your entitlement model ends up selecting 
> versions at random.
>
> -- Mike
>
> On 21 Aug 2011, at 18:08 , Michael Sokolov wrote:
>
>    
>> I'm looking for advice about how best to solve a querying problem for
>> one of our customers using MarkLogic.  We need to be able to search
>> documents that are stored in various versions, and we only want to
>> search the newest version. What I am interested in here is whether it is
>> possible to do this if the "newest version" property has to be computed,
>> not stored.  So something like:
>>
>> for $doc in cts:search(doc(), $random-query)
>> where $doc/@version = max(doc()/doc[@uri=$doc/@uri]/@version)
>> return $doc
>>
>> our documents store the document identifier that associates different
>> versions as /doc/@uri
>>
>> Is that it? Maybe a range index on /doc/@uri would help there?
>>
>> The real kicker is that we also need to be apply an additional
>> constraint in that some users may have access only to certain
>> document-versions, so in those cases we need to search only the newest
>> accessible version of each doc.
>>
>> Of course there can be lots of docs matching the $random-query, we want
>> to be able to apply sorting criteria, and get the first page of results
>> efficiently.
>>
>> Currently I am planning to generate tags at load-time that should make
>> the querying efficient (basically I will mark, for every possible
>> accessibility condition, the most current version - this is possible, if
>> irritating, due to the structure of the access control rules), but this
>> will introduce pain during document ingestion, and relies on
>> restrictions of the kind of access control rules we can have. So I'm
>> wondering if there is a passable query-time implementation I could use.
>>
>> -Mike
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>>      
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>    
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] searching document versions

Reply via email to