Hi Ivan,
It might not fit your problem, and I haven't tested but this might be slightly
quicker:
if (doc(concat($path, 'resource.xml')//binary-resource = $item) then
If you are processing over 2 mln uris then using an asynchronized process might
not be a bad idea at all. Use spawn instead of invoke and let the query write
the result to the document-properties instead of returning it as xml. It saves
gathering 2 mln check results at the same time.
Then write another query that analyses the document-properties and returns
counts (using xdmp:estimate) of OK's and warns..
HTH,
Geert
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Baranov, Ivan - Moscow
> Sent: dinsdag 18 augustus 2009 8:15
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] RE: Processing huge sequences
>
> Geert,
>
> The task is to go through a list of string values and perform
> a simple operation for each of them. More precise: I have
> about 2,000,000 URIs which I received as a plain text
> document and then turned into XML by means of Perl. Each of
> them has the following structure:
>
> content/repository001/data/store001/location001/file.dat
>
> and represents a path to a binary resource which is located
> in some remote data repository (nothing to do with MarkLogic).
>
> In the same time, /data/store001/location001/ is a directory
> on my MarkLogic server where resource.xml file can be found.
> In that file there is a node <binary-resource> which must
> contain binary resource URI, so its value is similar to what
> was described above:
>
> content/repository001/data/store001/location001/file.dat
>
> What I need is to go over all of 2,000,000 URIs in my list
> and check if some of them are not referenced in the
> appropriate XML instances on MarkLogic, i.e. analyze.xqy does
> the following:
>
> define variable $uri as xs:string external
> (: $uri =
> "content/repository001/data/store001/location001/file.dat" :)
>
> let $path :=
> fn:concat(
> "/",
> fn:string-join(
> fn:tokenize($uri, "/")[3 to fn:last()-1],
> "/"
> ),
> "/"
> )
> (: $path = "/data/store001/location001/" :)
>
> return
> if (xdmp:directory($path, "1")//binary-resource[1] =
> $item) then (: Checking reference :)
> <result path="{$path}">Check OK</result>
> else
> <result path="{$path}">WARNING: Resource not
> bound</result>
>
> Apologies for the long message, I just wanted to make things clear.
>
> Thanks,
> _Van
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Geert Josten
> Sent: Monday, August 17, 2009 6:26 PM
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] RE: Processing huge sequences
>
> Hi Ivan,
>
> Can you describe in more functional terms what you are trying
> to do? I have the impression that there should be smarter
> ways of tackling your problem. Do you really need this
> items.xml for instance? Wouldn't it be possible to use a
> cts:search in MarkLogic Server to compose this XML dynamically?
>
> And analyze.xqy taking about 400 sec to perform: if it
> concerns only lookups and not to much calculation work, it
> sounds like a lot as well.
>
> Have you considered taking an asynchronous approach? You can
> use xdmp:spawn for that or utilize the Content Processing Framework..
>
> Kind regards,
> Geert
>
> >
>
>
> Drs. G.P.H. Josten
> Consultant
>
>
> http://www.daidalos.nl/
> Daidalos BV
> Source of Innovation
> Hoekeindsehof 1-4
> 2665 JZ Bleiswijk
> Tel.: +31 (0) 10 850 1200
> Fax: +31 (0) 10 850 1199
> http://www.daidalos.nl/
> KvK 27164984
> De informatie - verzonden in of met dit emailbericht - is
> afkomstig van Daidalos BV en is uitsluitend bestemd voor de
> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen,
> verzoeken wij u het te verwijderen. Aan dit bericht kunnen
> geen rechten worden ontleend.
>
>
> > From: [email protected]
> > [mailto:[email protected]] On Behalf
> Of Baranov,
> > Ivan - Moscow
> > Sent: maandag 17 augustus 2009 15:51
> > To: General Mark Logic Developer Discussion
> > Subject: [MarkLogic Dev General] Processing huge sequences
> >
> > Hi All,
> >
> > I'm experiencing problems when processing long sequences.
> > E.g. there is one XML file which has following structure:
> >
> > items.xml
> > ---------
> >
> > <root>
> > <item id="/data/store001/location001/"/>
> > <item id="/data/store001/location012/"/>
> > <item id="/data/store003/location006/"/>
> > .
> > .
> > .
> > <item id="/data/store115/location322/"/>
> > </root>
> >
> > Where fn:count(//item) = 15,000. For each of them I must perform a
> > simple operation involving xdmp:directory(@id, "1") call. Say, some
> > node check. So, what I do next is I write two XQuery queries using
> > xdmp:invoke() method.
> >
> > main.xqy
> > --------
> >
> > let $items := fn:doc("/items.xml")
> > return
> > <results>
> > {
> > for $i in $items//item
> > return
> > try {
> > xdmp:invoke("/analyze.xqy",
> (xs:QName("item"),
> > fn:string($item)),
> > <options xmlns="xdmp:eval">
> >
> > <isolation>different-transaction</isolation>
> >
> > <prevent-deadlocks>true</prevent-deadlocks>
> > </options>
> > )
> > }
> > catch ($ex) {
> > $ex
> > }
> > }
> > </results>
> >
> > analyze.xqy does some xdmp:directory() stuff for each item.
> > It takes approx. 400s or something for this script set to
> perform the
> > task, which is a good result. Cool.
> >
> > BUT - when I tried to go through the larger list which included
> > 2,000,000 items, I even failed to upload it via WebDAV.
> After cutting
> > into pieces each of 100,000 items, I managed to upload them
> but then
> > failed to get the results.
> > After two hours of waiting ML threw an exception saying that the
> > timeout limit was exceeded.
> >
> > I would be very thankful if someone could help me out with this or
> > give me some advice.
> >
> > Thanks,
> > Van
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://xqzone.com/mailman/listinfo/general
> >
>
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general