Hi Ivan,

It might not fit your problem, and I haven't tested but this might be slightly 
quicker:

        if (doc(concat($path, 'resource.xml')//binary-resource = $item) then

If you are processing over 2 mln uris then using an asynchronized process might 
not be a bad idea at all. Use spawn instead of invoke and let the query write 
the result to the document-properties instead of returning it as xml. It saves 
gathering 2 mln check results at the same time.

Then write another query that analyses the document-properties and returns 
counts (using xdmp:estimate) of OK's and warns..

HTH,
Geert 

> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Baranov, Ivan - Moscow
> Sent: dinsdag 18 augustus 2009 8:15
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] RE: Processing huge sequences
> 
> Geert,
> 
> The task is to go through a list of string values and perform 
> a simple operation for each of them. More precise: I have 
> about 2,000,000 URIs which I received as a plain text 
> document and then turned into XML by means of Perl. Each of 
> them has the following structure:
> 
> content/repository001/data/store001/location001/file.dat
> 
> and represents a path to a binary resource which is located 
> in some remote data repository (nothing to do with MarkLogic).
> 
> In the same time, /data/store001/location001/ is a directory 
> on my MarkLogic server where resource.xml file can be found. 
> In that file there is a node <binary-resource> which must 
> contain binary resource URI, so its value is similar to what 
> was described above:
> 
> content/repository001/data/store001/location001/file.dat
> 
> What I need is to go over all of 2,000,000 URIs in my list 
> and check if some of them are not referenced in the 
> appropriate XML instances on MarkLogic, i.e. analyze.xqy does 
> the following:
> 
> define variable $uri as xs:string external
> (: $uri = 
> "content/repository001/data/store001/location001/file.dat" :)
> 
> let $path :=
>       fn:concat(
>               "/",
>               fn:string-join(
>                       fn:tokenize($uri, "/")[3 to fn:last()-1],
>                       "/"
>               ),
>               "/"
>       )
> (: $path = "/data/store001/location001/" :)
> 
> return
>       if (xdmp:directory($path, "1")//binary-resource[1] = 
> $item) then                   (: Checking reference :)
>               <result path="{$path}">Check OK</result>
>       else
>               <result path="{$path}">WARNING: Resource not 
> bound</result>
> 
> Apologies for the long message, I just wanted to make things clear.
> 
> Thanks,
> _Van
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Geert Josten
> Sent: Monday, August 17, 2009 6:26 PM
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] RE: Processing huge sequences
> 
> Hi Ivan,
> 
> Can you describe in more functional terms what you are trying 
> to do? I have the impression that there should be smarter 
> ways of tackling your problem. Do you really need this 
> items.xml for instance? Wouldn't it be possible to use a 
> cts:search in MarkLogic Server to compose this XML dynamically?
> 
> And analyze.xqy taking about 400 sec to perform: if it 
> concerns only lookups and not to much calculation work, it 
> sounds like a lot as well.
> 
> Have you considered taking an asynchronous approach? You can 
> use xdmp:spawn for that or utilize the Content Processing Framework..
> 
> Kind regards,
> Geert
> 
> >
> 
> 
> Drs. G.P.H. Josten
> Consultant
> 
> 
> http://www.daidalos.nl/
> Daidalos BV
> Source of Innovation
> Hoekeindsehof 1-4
> 2665 JZ Bleiswijk
> Tel.: +31 (0) 10 850 1200
> Fax: +31 (0) 10 850 1199
> http://www.daidalos.nl/
> KvK 27164984
> De informatie - verzonden in of met dit emailbericht - is 
> afkomstig van Daidalos BV en is uitsluitend bestemd voor de 
> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, 
> verzoeken wij u het te verwijderen. Aan dit bericht kunnen 
> geen rechten worden ontleend.
> 
> 
> > From: [email protected]
> > [mailto:[email protected]] On Behalf 
> Of Baranov, 
> > Ivan - Moscow
> > Sent: maandag 17 augustus 2009 15:51
> > To: General Mark Logic Developer Discussion
> > Subject: [MarkLogic Dev General] Processing huge sequences
> >
> > Hi All,
> >
> > I'm experiencing problems when processing long sequences.
> > E.g. there is one XML file which has following structure:
> >
> > items.xml
> > ---------
> >
> > <root>
> >       <item id="/data/store001/location001/"/>
> >       <item id="/data/store001/location012/"/>
> >       <item id="/data/store003/location006/"/>
> >       .
> >       .
> >       .
> >       <item id="/data/store115/location322/"/>
> > </root>
> >
> > Where fn:count(//item) = 15,000. For each of them I must perform a 
> > simple operation involving xdmp:directory(@id, "1") call. Say, some 
> > node check. So, what I do next is I write two XQuery queries using 
> > xdmp:invoke() method.
> >
> > main.xqy
> > --------
> >
> > let $items := fn:doc("/items.xml")
> > return
> >       <results>
> >       {
> >               for $i in $items//item
> >               return
> >             try {
> >                       xdmp:invoke("/analyze.xqy", 
> (xs:QName("item"), 
> > fn:string($item)),
> >                                       <options xmlns="xdmp:eval">
> >
> > <isolation>different-transaction</isolation>
> >
> > <prevent-deadlocks>true</prevent-deadlocks>
> >                                       </options>
> >                       )
> >             }
> >             catch ($ex) {
> >                       $ex
> >             }
> >     }
> >     </results>
> >
> > analyze.xqy does some xdmp:directory() stuff for each item.
> > It takes approx. 400s or something for this script set to 
> perform the 
> > task, which is a good result. Cool.
> >
> > BUT - when I tried to go through the larger list which included 
> > 2,000,000 items, I even failed to upload it via WebDAV. 
> After cutting 
> > into pieces each of 100,000 items, I managed to upload them 
> but then 
> > failed to get the results.
> > After two hours of waiting ML threw an exception saying that the 
> > timeout limit was exceeded.
> >
> > I would be very thankful if someone could help me out with this or 
> > give me some advice.
> >
> > Thanks,
> > Van
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://xqzone.com/mailman/listinfo/general
> >
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to