Re: [MarkLogic Dev General] Bulk content processing in MarkLogic

Alexei Betin Thu, 15 Jan 2015 16:52:50 -0800

Thanks, Paul and David!

Using simply xdmp:spawn-function() did not help much since I ran into "maximum 
tasks" limit when naively spawning a function for each document.


But CoRB worked beautifully and it's definitely parallelizing the job just the 
way I wanted.

I am also going to take a look at Taskbot which seems really useful.

Thanks again!

[Forward Slash]

[Elevate]

Alexei Betin

Principal Architect; Big Data
P: (817) 928-1643 | Elevate.com<http://www.elevate.com>
4150 International Plaza, Suite 300
Fort Worth, TX 76109


Privileged and Confidential. This e-mail, and any attachments thereto, is 
intended only for use by the addressee(s) named herein and may contain 
privileged and/or confidential information. If you have received this e-mail in 
error, please notify me immediately by a return e-mail and delete this e-mail. 
You are hereby notified that any dissemination, distribution or copying of this 
e-mail and/or any attachments thereto, is strictly prohibited.


From: [email protected] 
[mailto:[email protected]] On Behalf Of Paul Hoehne
Sent: Thursday, January 15, 2015 10:50 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Bulk content processing in MarkLogic

There's also the CORB facility.  I would try Taskbot, but if you're not 
familiar with it, I would also try to do a version of it using 
xdmp:spawn-function.  Learning to use xdmp:spawn-function is s a sometimes 
over-looked but extremely useful function.

Paul Hoehne
Senior Consultant
MarkLogic Corporation
[email protected]<mailto:[email protected]>
mobile: +1 571 830 4735
www.marklogic.com<http://www.marklogic.com>

Click http://po.st/hMGDFm to get your free NoSQL For Dummies e-book!

From: David Ennis <[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Thursday, January 15, 2015 at 1:43 PM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Bulk content processing in MarkLogic

HI.

I usually spawn these types of things in batches.

Also, There is also a nice utility by Michael Blakeley out there to help manage 
this and make good use of the resources of your particular setup - including a 
nice sample to start with:

https://github.com/mblakele/taskbot

It uses pretty much the same functions you would likely use on your own - but 
organized nicely in a reusable/configurable way.


Kind Regards,
David Ennis


David Ennis
Content Engineer

[HintTech] <http://www.hinttech.com/>
Mastering the value of content
creative | technology | content

Delftechpark 37i
2628 XJ Delft
The Netherlands
T: +31 88 268 25 00
M: +31 63 091 72 80

[http://www.hinttech.com]<http://www.hinttech.com> 
[http://www.hinttech.com/signature/Twitter_HintTech.png] 
<https://twitter.com/HintTech>  
[http://www.hinttech.com/signature/Facebook_HintTech.png] 
<http://www.facebook.com/HintTech>  
[http://www.hinttech.com/signature/Linkedin_HintTech.png] 
<http://www.linkedin.com/company/HintTech>

On 15 January 2015 at 19:33, Alexei Betin 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I stumble upon what seems to be a straightforward task of making a bulk 
modification of XML documents in MarkLogic (such as adding a new element to 
every document in the collection). I've looked at CPF first, but it seems like 
it only supports event-based processing (triggers) and does not have any 
facility for batch processing.

So I just write a simple xQuery as follows:

for $x in collection()/A  return ( xdmp:node-delete( $x/test ), 
xdmp:node-insert-child( $x, <test>test</test> ) )

but it runs out of memory on a large collection - "Expanded tree cache full". 
So it looks like the above query is trying fetch all documents into memory 
first, then iterate over them.

Whereas what I want is to perform the work on smaller chunks of data that fit 
into memory and, ideally, do several such chunks in parallel (think "map" 
without "reduce").

Is there another approach? I am reading about CoRB that seems to be just the 
thing I need, but I wonder if I am missing another potential solution here.

Also, while CoRB description mentions that it can run updates on disk (not in 
memory), it does not mention parallelization - which eventually will be quite 
important for my use case.

Thanks,

[Forward Slash]

[Elevate]

Alexei Betin

Principal Architect; Big Data
P: (817) 928-1643<tel:%28817%29%20928-1643> | 
Elevate.com<http://www.elevate.com>
4150 International Plaza, Suite 300
Fort Worth, TX 76109


Privileged and Confidential. This e-mail, and any attachments thereto, is 
intended only for use by the addressee(s) named herein and may contain 
privileged and/or confidential information. If you have received this e-mail in 
error, please notify me immediately by a return e-mail and delete this e-mail. 
You are hereby notified that any dissemination, distribution or copying of this 
e-mail and/or any attachments thereto, is strictly prohibited.



_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Bulk content processing in MarkLogic

Reply via email to