Thanks Kelly, thanks All,

Your advices are very helpful! I'm trying it definitely.

Van

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Kelly Stirman
Sent: Tuesday, August 18, 2009 5:33 PM
To: [email protected]
Subject: [MarkLogic Dev General] RE: Processing huge sequences

Ivan,

If <binary-node/> has the same value as the URI you are checking against, then 
I think you can do the following:

1) create an element range index of type string with collation equal to 
codepoint collation on <binary-node/>
2) iterate over each of the values in your list and check whether it exists in 
any of your <binary-node/> values using cts:element-range-query() 
(http://developer.marklogic.com/pubs/4.1/apidocs/cts-query.html#cts:element-range-query),
 and only return values that do not match as an XML report.

If you need to consider the database URI and the value of <binary-node/>, then 
I suggest combining the two into a new element or attribute that you can use 
for a range index to follow the same approach.

Range indexes are memory-mapped and much faster than retrieving full documents 
from disk. Even at 10ms/doc, 2M queries is going to take a long time to follow 
your approach of looking at each document. I think the range index approach 
will be at least an order of magnitude faster.

Others may have elaborations on this approach. For example, you could spawn 
each URI in your list to check the range index, and write to the doc properties 
if it doesn't match, per Geert's recommendation. 

Kelly

Geert,

The task is to go through a list of string values and perform a simple 
operation for each of them. More precise: I have about 2,000,000 URIs which I 
received as a plain text document and then turned into XML by means of Perl. 
Each of them has the following structure:

content/repository001/data/store001/location001/file.dat

and represents a path to a binary resource which is located in some remote data 
repository (nothing to do with MarkLogic).

In the same time, /data/store001/location001/ is a directory on my MarkLogic 
server where resource.xml file can be found. In that file there is a node 
<binary-resource> which must contain binary resource URI, so its value is 
similar to what was described above:

content/repository001/data/store001/location001/file.dat

What I need is to go over all of 2,000,000 URIs in my list and check if some of 
them are not referenced in the appropriate XML instances on MarkLogic, i.e. 
analyze.xqy does the following:

define variable $uri as xs:string external
(: $uri = "content/repository001/data/store001/location001/file.dat" :)

let $path :=
        fn:concat(
                "/",
                fn:string-join(
                        fn:tokenize($uri, "/")[3 to fn:last()-1],
                        "/"
                ),
                "/"
        )
(: $path = "/data/store001/location001/" :)

return
        if (xdmp:directory($path, "1")//binary-resource[1] = $item) then        
                (: Checking reference :)
                <result path="{$path}">Check OK</result>
        else
                <result path="{$path}">WARNING: Resource not bound</result>

Apologies for the long message, I just wanted to make things clear.

Thanks,
_Van
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to