The document URI is normatively stored in the disk block with the document data and properties so it does require loading the document into memory to get its URI ... providing you are referencing it with a document node.
If the document is pulled into memory for the sole purpose of getting its URI it can be slow. To test this I have a DB with 1.6mil tweets ... Even after trying it once , these calls are slow: on my system count( doc()/fn:base-uri() ) 1min 25 sec count( doc()/fn:document-uri() ) 1min 26 secs count( doc()/xdmp:node-uri(.) ) 1min 22 secs But if all you want are URI's consider the uri lexicon. This lexicon is stored separately from the document and all together so iterating through all the URI's is much faster. Even without using the advanced filtering functions this can be fast count( cts:uris() ) 0.36 seconds if you are dealing with billions of docs instead of a million then you should definately use the advanced options for this call to retrieve only the URI's that you want. If the document is already in memory, fetching its URI is fast (and I dont know another way but using one of the above xxx-uri() methods). ----------------------------------------------------------------------------- David Lee Lead Engineer MarkLogic Corporation [email protected] Phone: +1 812-482-5224 Cell: +1 812-630-7622 www.marklogic.com<http://www.marklogic.com/> From: [email protected] [mailto:[email protected]] On Behalf Of anoop raj p Sent: Tuesday, October 22, 2013 6:15 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] derferencing documents with document-uri and base-uri? Please remove me from email list. On Tue, Oct 22, 2013 at 3:44 PM, Rachel Wilson <[email protected]<mailto:[email protected]>> wrote: I didn't think it was a problem as such, I wasn't trying to prematurely optimise I promise but I was curious about the workings under the hood since we use these functions a lot including our slower running queries - investigating those is how this question came up. Think about this as settling a bet ;) So, I"m still curious - what is dereferencing? is that indeed what happens? Say we have a a database node returned from a query, which isn't the document node, and we call base-uri on it, would the whole document itself necessarily have been put in the expanded tree cache in order to resolve the query? I'm still learning about the roles of the different caches and its turning out to be very helpful to know. PS. We don't have subfragments -----Original Message----- From: Michael Blakeley <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Monday, 21 October 2013 18:39 To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: Re: [MarkLogic Dev General] derferencing documents with document-uri and base-uri? I wouldn't worry about it unless it's clearly a problem: avoid premature optimization. If you have a database node in memory, then it's in the expanded tree cache. So repeated accessor calls for its URI can drive cache lookups and CPU cycles, but should never result in cache misses. Check the xdmp:query-meters output to see this for yourself: you should be able to correlate the number of URI accesses to the expanded-tree-cache-hit count. Things might get a little more expensive if you have subfragments, because crossing fragment boundaries can be expensive. A call to base-uri inside subfragment might have to traverse to the parent fragment - or maybe not, I'd have to design a test to say for certain. But the time to worry is when you have a performance problem, and your test case shows the URI accessor in the profiler output. Then you could think about ways to minimize URI lookups. Switching to functionality, I almost always use xdmp:node-uri rather than document-uri or base-uri. I avoid document-uri simply because I don't want to worry about traversing to root for document-uri, and base-uri because I don't want the behavior where an ancestor element specifies its own base-uri value. That's rare in most XML, but base-uri checks for it and honors it. Checking for that probably slows things down a bit, and honoring it generally doesn't do what I want. So I always use xdmp:node-uri instead. -- Mike On 21 Oct 2013, at 09:54 , Rachel Wilson <[email protected]<mailto:[email protected]>> wrote: > > I have heard on the grapevine that to use document-uri() or base-uri() >functions is bad for performance, although I can't seem to find anything >about that in MarkLogic's docs or elsewhere on the internet. One of the >reasons given was that using those functions "dereference the document", >or that MarkLogic Server has to go to disk to resolve the uri. Although >I'm not sure what is really meant by "dereference" > > Could someone clear this up. Has the grapevine got the wrong end of the >stick or is it perhaps how the function is used, perhaps in loops, that >is the reason behind this thinking? We use those two functions so much, >particularly base-uri(), in our code that we would consider some rewrites >if it really is something to minimise. > > Many thanks, > Rachel > > > > ---------------------------- > > http://www.bbc.co.uk > This e-mail (and any attachments) is confidential and may contain >personal views which are not the views of the BBC unless specifically >stated. > If you have received it in error, please delete it from your system. > Do not use, copy or disclose the information in any way nor act in >reliance on it and notify the sender immediately. > Please note that the BBC monitors e-mails sent or received. > Further communication will signify your consent to this. > > --------------------- > > _______________________________________________ > General mailing list > [email protected]<mailto:[email protected]> > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected]<mailto:[email protected]> http://developer.marklogic.com/mailman/listinfo/general ----------------------------- http://www.bbc.co.uk This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. ----------------------------- _______________________________________________ General mailing list [email protected]<mailto:[email protected]> http://developer.marklogic.com/mailman/listinfo/general -- anoop raj p
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
