[MarkLogic Dev General] RE: Processing huge sequences

Geert Josten Tue, 18 Aug 2009 06:46:40 -0700

Good one, much smarter this way. Better to save asynchonious processing for 
slow queries that cannot be optimized..


Grtz

>


Drs. G.P.H. Josten
Consultant


http://www.daidalos.nl/
Daidalos BV
Source of Innovation
Hoekeindsehof 1-4
2665 JZ Bleiswijk
Tel.: +31 (0) 10 850 1200
Fax: +31 (0) 10 850 1199
http://www.daidalos.nl/
KvK 27164984
De informatie - verzonden in of met dit emailbericht - is afkomstig van 
Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit 
bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan dit 
bericht kunnen geen rechten worden ontleend.


> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Kelly Stirman
> Sent: dinsdag 18 augustus 2009 15:33
> To: [email protected]
> Subject: [MarkLogic Dev General] RE: Processing huge sequences
>
> Ivan,
>
> If <binary-node/> has the same value as the URI you are
> checking against, then I think you can do the following:
>
> 1) create an element range index of type string with
> collation equal to codepoint collation on <binary-node/>
> 2) iterate over each of the values in your list and check
> whether it exists in any of your <binary-node/> values using
> cts:element-range-query()
> (http://developer.marklogic.com/pubs/4.1/apidocs/cts-query.htm
> l#cts:element-range-query), and only return values that do
> not match as an XML report.
>
> If you need to consider the database URI and the value of
> <binary-node/>, then I suggest combining the two into a new
> element or attribute that you can use for a range index to
> follow the same approach.
>
> Range indexes are memory-mapped and much faster than
> retrieving full documents from disk. Even at 10ms/doc, 2M
> queries is going to take a long time to follow your approach
> of looking at each document. I think the range index approach
> will be at least an order of magnitude faster.
>
> Others may have elaborations on this approach. For example,
> you could spawn each URI in your list to check the range
> index, and write to the doc properties if it doesn't match,
> per Geert's recommendation.
>
> Kelly
>
> Geert,
>
> The task is to go through a list of string values and perform
> a simple operation for each of them. More precise: I have
> about 2,000,000 URIs which I received as a plain text
> document and then turned into XML by means of Perl. Each of
> them has the following structure:
>
> content/repository001/data/store001/location001/file.dat
>
> and represents a path to a binary resource which is located
> in some remote data repository (nothing to do with MarkLogic).
>
> In the same time, /data/store001/location001/ is a directory
> on my MarkLogic server where resource.xml file can be found.
> In that file there is a node <binary-resource> which must
> contain binary resource URI, so its value is similar to what
> was described above:
>
> content/repository001/data/store001/location001/file.dat
>
> What I need is to go over all of 2,000,000 URIs in my list
> and check if some of them are not referenced in the
> appropriate XML instances on MarkLogic, i.e. analyze.xqy does
> the following:
>
> define variable $uri as xs:string external
> (: $uri =
> "content/repository001/data/store001/location001/file.dat" :)
>
> let $path :=
>       fn:concat(
>               "/",
>               fn:string-join(
>                       fn:tokenize($uri, "/")[3 to fn:last()-1],
>                       "/"
>               ),
>               "/"
>       )
> (: $path = "/data/store001/location001/" :)
>
> return
>       if (xdmp:directory($path, "1")//binary-resource[1] =
> $item) then                   (: Checking reference :)
>               <result path="{$path}">Check OK</result>
>       else
>               <result path="{$path}">WARNING: Resource not
> bound</result>
>
> Apologies for the long message, I just wanted to make things clear.
>
> Thanks,
> _Van
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
>

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] RE: Processing huge sequences

Reply via email to