There is no built-in way to build a lexicon of XPaths. You can index the values of nodes at specific paths, but that doesn't build an index the paths.
One option is to sample the data, using http://docs.marklogic.com/cts:search with the score-random option. Also xdmp:path is probably faster than what you're doing. So start with something like this, setting $LIMIT to the largest value that completes comfortably: subsequence(cts:search(doc(), cts:collection-query('0NF9'), 'score-random'), 1, $LIMIT) /descendant::*/xdmp:path(.) If you want more data, run multiple samples and merge the results. Map operations could be useful for that. You could probably speed up the way you're looking up existing XPaths too. The obvious optimization is to haul the fn:doc('/example.xml')//xpath expression to the top of the FLWOR, but I'd suggest looking into maps too. You could probably refactor to call xdmp:node-insert-child just once, as well. And many paths are identical aside from position, so you might add a call to replace '\[\d+\]' with '' and then wrap the whole thing with distinct-values. Another enhancement might be to use xdmp:spawn, with one task per document. That doesn't make the work any faster per se, but it permits parallelism so the wall-clock time could be shorter. If sampling isn't good enough, turning all this on its head. Instead of doing the work in a batch, do the work as each document is inserted or updated. This would use a CPF pipeline, or a standalone trigger. It would permit parallelism because each document updates independently. The CPF action or trigger would gather all the document's paths into 'xpath' elements, and add them to the body or properties of the document. Given a string range index on 'xpath', you have your path lexicon - and with frequency data too. -- Mike On 26 Jul 2013, at 02:52 , "Vaitkus, Arunas (LNG-LON)" <[email protected]> wrote: > Dear Marklogic Users, > > I have a real problem and got stuck at a dead-end (or so I think at the > moment!). I have a number of XML files uploaded to MarkLogic server and need > to run a report listing all possible absolute XPATHs. What this means is that > I need to parse each XML document, find each node’s unique absolute XPATH > (for example “/full/xpath/namespace:to/test:specific/node” as I don’t need to > know about their position using predicates) and then insert a report to > another XML file. > > It is fine with a small set of content (less than 2000 files) as I hit > expanded tree cache. However, in my case there are tens of thousands of files > that make everything very slow. > > What I am looking for is if there is any indexing capabilities that could > help me out in this case (getting these XPATHs)? > > Currently, my Xquery is very simple and does not reuse any of indexing (as I > am not sure what I could change to get indexing gains here): > > ----------------------- > xquery version "1.0-ml"; > declare namespace html = "http://www.w3.org/1999/xhtml"; > > <report> > { > for $collection in subsequence(collection('0NF9'), 1, 20000) > for $document in $collection > for $node in $document//* > let $full-xpath := $node/string-join(ancestor-or-self::*/name(), > '/') (: this is building textual full XPath representation :) > let $row := fn:doc('/example.xml')//xpath > return if (not($row=$full-xpath)) then > (xdmp:node-insert-child(doc("/example.xml")/report, (<xpath > date="{format-date(fn:current-date(), "[Y0001]-[M01]-[D01]")}" > approved="no">{$full-xpath}</xpath>))) else () > > } > </report> > -------------- > > Many Thanks, > Arunas > > LexisNexis is a trading name of REED ELSEVIER (UK) LIMITED - Registered > office - 1-3 STRAND, LONDON WC2N 5JR > Registered in England - Company No. 02746621 > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
