Re: [MarkLogic Dev General] Fast (indexed) query for getting unique absolute Xpaths from each file

Michael Blakeley Fri, 26 Jul 2013 09:51:08 -0700

There is no built-in way to build a lexicon of XPaths. You can index the values 
of nodes at specific paths, but that doesn't build an index the paths.

One option is to sample the data, using http://docs.marklogic.com/cts:search 
with the score-random option. Also xdmp:path is probably faster than what 
you're doing. So start with something like this, setting $LIMIT to the largest 
value that completes comfortably:

    subsequence(cts:search(doc(), cts:collection-query('0NF9'), 
'score-random'), 1, $LIMIT)
      /descendant::*/xdmp:path(.)

If you want more data, run multiple samples and merge the results. Map 
operations could be useful for that.

You could probably speed up the way you're looking up existing XPaths too. The 
obvious optimization is to haul the fn:doc('/example.xml')//xpath expression to 
the top of the FLWOR, but I'd suggest looking into maps too. You could probably 
refactor to call xdmp:node-insert-child just once, as well. And many paths are 
identical aside from position, so you might add a call to replace '\[\d+\]' 
with '' and then wrap the whole thing with distinct-values.

Another enhancement might be to use xdmp:spawn, with one task per document. 
That doesn't make the work any faster per se, but it permits parallelism so the 
wall-clock time could be shorter.

If sampling isn't good enough, turning all this on its head. Instead of doing 
the work in a batch, do the work as each document is inserted or updated. This 
would use a CPF pipeline, or a standalone trigger. It would permit parallelism 
because each document updates independently. The CPF action or trigger would 
gather all the document's paths into 'xpath' elements, and add them to the body 
or properties of the document. Given a string range index on 'xpath', you have 
your path lexicon - and with frequency data too.

-- Mike

On 26 Jul 2013, at 02:52 , "Vaitkus, Arunas (LNG-LON)" 
<[email protected]> wrote:

> Dear Marklogic Users,
>  
> I have a real problem and got stuck at a dead-end (or so I think at the 
> moment!). I have a number of XML files uploaded to MarkLogic server and need 
> to run a report listing all possible absolute XPATHs. What this means is that 
> I need to parse each XML document, find each node’s unique absolute XPATH 
> (for example “/full/xpath/namespace:to/test:specific/node” as I don’t need to 
> know about their position using predicates) and then insert a report to 
> another XML file.
>  
> It is fine with a small set of content (less than 2000 files) as I hit 
> expanded tree cache. However, in my case there are tens of thousands of files 
> that make everything very slow.
>  
> What I am looking for is if there is any indexing capabilities that could 
> help me out in this case (getting these XPATHs)?
>  
> Currently, my Xquery is very simple and does not reuse any of indexing (as I 
> am not sure what I could change to get indexing gains here):
>  
> -----------------------
> xquery version "1.0-ml";
> declare namespace html = "http://www.w3.org/1999/xhtml";;
>  
> <report>
> {
>  for $collection in subsequence(collection('0NF9'), 1, 20000)  
>      for $document in $collection  
>         for $node in $document//*   
>            let $full-xpath := $node/string-join(ancestor-or-self::*/name(), 
> '/')  (: this is building textual full XPath representation :)
>            let $row := fn:doc('/example.xml')//xpath  
>               return if (not($row=$full-xpath)) then 
> (xdmp:node-insert-child(doc("/example.xml")/report, (<xpath 
> date="{format-date(fn:current-date(), "[Y0001]-[M01]-[D01]")}" 
> approved="no">{$full-xpath}</xpath>))) else ()
>  
> }
> </report>
> --------------
>  
> Many Thanks,
> Arunas
> 
> LexisNexis is a trading name of REED ELSEVIER (UK) LIMITED - Registered 
> office - 1-3 STRAND, LONDON WC2N 5JR
> Registered in England - Company No. 02746621 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Fast (indexed) query for getting unique absolute Xpaths from each file

Reply via email to