Good one, much smarter this way. Better to save asynchonious processing for slow queries that cannot be optimized..
Grtz > Drs. G.P.H. Josten Consultant http://www.daidalos.nl/ Daidalos BV Source of Innovation Hoekeindsehof 1-4 2665 JZ Bleiswijk Tel.: +31 (0) 10 850 1200 Fax: +31 (0) 10 850 1199 http://www.daidalos.nl/ KvK 27164984 De informatie - verzonden in of met dit emailbericht - is afkomstig van Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan dit bericht kunnen geen rechten worden ontleend. > From: [email protected] > [mailto:[email protected]] On Behalf Of > Kelly Stirman > Sent: dinsdag 18 augustus 2009 15:33 > To: [email protected] > Subject: [MarkLogic Dev General] RE: Processing huge sequences > > Ivan, > > If <binary-node/> has the same value as the URI you are > checking against, then I think you can do the following: > > 1) create an element range index of type string with > collation equal to codepoint collation on <binary-node/> > 2) iterate over each of the values in your list and check > whether it exists in any of your <binary-node/> values using > cts:element-range-query() > (http://developer.marklogic.com/pubs/4.1/apidocs/cts-query.htm > l#cts:element-range-query), and only return values that do > not match as an XML report. > > If you need to consider the database URI and the value of > <binary-node/>, then I suggest combining the two into a new > element or attribute that you can use for a range index to > follow the same approach. > > Range indexes are memory-mapped and much faster than > retrieving full documents from disk. Even at 10ms/doc, 2M > queries is going to take a long time to follow your approach > of looking at each document. I think the range index approach > will be at least an order of magnitude faster. > > Others may have elaborations on this approach. For example, > you could spawn each URI in your list to check the range > index, and write to the doc properties if it doesn't match, > per Geert's recommendation. > > Kelly > > Geert, > > The task is to go through a list of string values and perform > a simple operation for each of them. More precise: I have > about 2,000,000 URIs which I received as a plain text > document and then turned into XML by means of Perl. Each of > them has the following structure: > > content/repository001/data/store001/location001/file.dat > > and represents a path to a binary resource which is located > in some remote data repository (nothing to do with MarkLogic). > > In the same time, /data/store001/location001/ is a directory > on my MarkLogic server where resource.xml file can be found. > In that file there is a node <binary-resource> which must > contain binary resource URI, so its value is similar to what > was described above: > > content/repository001/data/store001/location001/file.dat > > What I need is to go over all of 2,000,000 URIs in my list > and check if some of them are not referenced in the > appropriate XML instances on MarkLogic, i.e. analyze.xqy does > the following: > > define variable $uri as xs:string external > (: $uri = > "content/repository001/data/store001/location001/file.dat" :) > > let $path := > fn:concat( > "/", > fn:string-join( > fn:tokenize($uri, "/")[3 to fn:last()-1], > "/" > ), > "/" > ) > (: $path = "/data/store001/location001/" :) > > return > if (xdmp:directory($path, "1")//binary-resource[1] = > $item) then (: Checking reference :) > <result path="{$path}">Check OK</result> > else > <result path="{$path}">WARNING: Resource not > bound</result> > > Apologies for the long message, I just wanted to make things clear. > > Thanks, > _Van > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
