Re: [MarkLogic Dev General] Distributing Tasks

Geert Josten Thu, 30 Jul 2015 06:01:00 -0700

Would this help? It takes tasks from a shared database, but runs from a 
schedule, so I think it would spread across the cluster. Haven’t thoroughly 
tested it in a cluster environment though..


https://github.com/grtjn/ml-queue

Cheers,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Andreas Hubmer 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Thursday, July 30, 2015 at 2:47 PM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Distributing Tasks

Yes, this is exactly one way my library could work: Setting up a one-time 
scheduled task.
The drawback of scheduled tasks is that the task code needs to be in the module 
database. So I would always have to deploy it first.


2015-07-30 14:36 GMT+02:00 David Lee 
<[email protected]<mailto:[email protected]>>:
This is what let me to believe spawn *may* work across nodes
https://docs.marklogic.com/guide/admin/scheduling_tasks


10. In the Task User and Task Host fields, specify the user with permission to 
invoke the task and the host computer on which the task is to be invoked. If no 
host is specified, then the task runs on all hosts.

So given your feedback, this appears a feature of the task 'scheduler'
For your case that may be easy to use.
If the task just queries for documents on forests on that host then the same 
task can run
on all hosts without change.



-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
[email protected]<mailto:[email protected]>
Phone: +1 812-482-5224<tel:%2B1%20812-482-5224>
Cell:  +1 812-630-7622<tel:%2B1%20812-630-7622>
www.marklogic.com<http://www.marklogic.com/>

From:[email protected]<mailto:[email protected]>
 
[mailto:[email protected]<mailto:[email protected]>]
 On Behalf Of Andreas Hubmer
Sent: Thursday, July 30, 2015 8:31 AM

To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Distributing Tasks

David, thanks for your response.

It's true, that often my tasks are quite simple. Then IO bound is indeed the 
bottleneck. But using distribution I could speed up the processing by the 
number of available nodes.

I am using Taskbot for batch-processing currently and I start it on every node 
manually. On each node I only process the documents that are local to that node 
(all my nodes are E and D nodes).

In fact I need a tool/library that can run the taskbot-starting code on every 
node.

A nice-to-have feature is the possibility to check the status of all nodes and 
find out when they have finished or if any error has happened. This can be done 
with documents in the database.

The task queue is not shared by all nodes in a group. Each node has its own 
queue. I guess that is because it would be complicated to execute things like 
anonymous functions (for example when using xdmp:spawn-function) on other hosts 
(thinking of the function context...).

Regards,
Andreas

2015-07-30 14:00 GMT+02:00 David Lee 
<[email protected]<mailto:[email protected]>>:
This is an interesting use case.   Much of the distributed processing external 
tools (like mlcp) are designed around initial ingestion.
That is a case that if careful distributed can scale much better than a 
simplistic approach.  However in even in ingestion there are very specific 
cases that will be improved in this way vs much simpler basic concepts like 
batching and multithreading.   Even for ingestion its not always desirable to 
'over think' the server rather than let it manage the document distribution 
across forests itself.

But for bulk processing of the nature your describing (adding an element to 
every document).   That may not benefit much from distributing the task load 
across servers.   The user-mode (xquery/js) CPU and memory overhead may be very 
low compared to the IO.
Its conceivable (but I don't know if its implemented) that something like
the following actually execute on the 'd-node' containing the document.

https://docs.marklogic.com/xdmp:node-insert-after

xdmp:node-insert-after(doc("/example.xml")/a/b,
    <c>ccc</c>);

If it did then it wouldn't make much difference at all where it was executed,  
a few threads at once doing this (via corbs or manually ...
should be able to load the system to capacity and efficiency.
If it pulls the document to the calling node there's an additional overhead if 
its remote, but often not as much as people think.
The latency between hosts on a good network can be smaller than the latency to 
disk.  And at the point you reach IO bound the battle is over,
you can send data back and forth like a tennis match and it won't matter.   
Then re-indexing and merging is going to kick in and the 'minor' work of 
inserting a node will not be the main contributing factor in the total time.

Also considering that the task queue is shared by all nodes in the group, and 
xdmp:spawn , xdmp:spawn-function  make use of the task queue  I believe (need 
to check) that it also will make use of all hosts.

So either way as long as you get some parallelism the load should spread fairly 
well and get close to the ideal maximum if you make use of the lowest level 
document update calls as possible, and take care to not keep transactions and 
locks open as a side effect for searching for the documents.

Its worth trying a simple approach first before attempting to optimize.
Then a basic 'split into parallel batches' should get about as close to 
theoretical maximum as possible.  If you don't use an existing tool, I find its 
both easier overall, and easier to not hit a unexpected locking problem by 
pre-calculating the URI's of all the documents and split that list and store it 
(in the DB or on filesystem).   Then launch your batches giving it the list of 
URIs.
Very well written tools can do better than this, but its more tricky then it 
seems to iterate over the URI's and create batches all at once without running 
into some kind of locking or scheduling problem








-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
[email protected]<mailto:[email protected]>
Phone: +1 812-482-5224<tel:%2B1%20812-482-5224>
Cell:  +1 812-630-7622<tel:%2B1%20812-630-7622>
www.marklogic.com<http://www.marklogic.com/>

From:[email protected]<mailto:[email protected]>
 
[mailto:[email protected]<mailto:[email protected]>]
 On Behalf Of Geert Josten
Sent: Thursday, July 30, 2015 4:49 AM

To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Distributing Tasks

Hi Andreas,

Interesting slides, good find!

If you are talking about more ad hoc processing, you could look into things 
like https://github.com/mblakele/taskbot, and 
https://github.com/marklogic/corb2. These are tools that can batch up the work 
very well. They won’t spread load across a cluster automatically though. You 
could however try to split the load somehow, and run multiple instances in 
parallel, each against a different host. Though, that works best if you are 
targeting the host that actually holds the data you want to touch. But that is 
difficult. MLCP does that with its -fastload option. Would MLCP copy feature 
with a transform perhaps work?

MarkLogic also provides Hadoop integration, so maybe that is also worth looking 
at?

Cheers,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Andreas Hubmer 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Thursday, July 30, 2015 at 8:56 AM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Distributing Tasks

Hi Geert,

Thanks for the update.

Triggers and the CPF aren't exactly what I'm looking for. What I want to do is 
to distribute one-time tasks like adding new elements to all existing documents.

I've found some slides 
<http://developer.marklogic.com/media/mlw12/Distributed-Content-Processing-in-MarkLogic.pdf>
 from a ML consultant on "Distributed Content Processing in MarkLogic" but the 
code builds on ML 4.

Probably I'll create a lightweight library myself. Either using one-time 
scheduled tasks or an HTTP server for distributing the tasks.

Regards,
Andreas

2015-07-29 17:56 GMT+02:00 Geert Josten 
<[email protected]<mailto:[email protected]>>:
Hi Andreas,

I haven’t heard about anything in this direction recently, but FWIW I added a 
+1 to the RFE.

Could post-commit triggers, or CPF help out in some way? They should run on the 
host that holds the forest that holds the document at hand from what I have 
heard..

Cheers,
Geert


From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Andreas Hubmer 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, July 28, 2015 at 5:20 PM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: [MarkLogic Dev General] Distributing Tasks

Hello,

In this Knowledgebase 
article<https://help.marklogic.com/knowledgebase/article/View/112/0/techniques-for-dividing-tasks-between-hosts-in-a-cluster>
 there is talk about an RFE (2763) that would make it possible to pass in 
options into xdmp:spawn() to allow the execution of code on a specific host in 
a cluster.
Are there still any plans for this feature?

Thanks and cheers,
Andreas

--
Andreas Hubmer
IT Consultant


_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general



--
Andreas Hubmer
IT Consultant

EBCONT enterprise technologies GmbH
Millennium Tower
Handelskai 94-96
A-1200 Vienna

Mobile: +43 664 60651861<tel:%2B43%20664%2060651861>
Fax: +43 2772 512 69-9
Email: [email protected]<mailto:[email protected]>
Web: http://www.ebcont.com

OUR TEAM IS YOUR SUCCESS

UID-Nr. ATU68135644
HG St.Pölten - FN 399978 d

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general



--
Andreas Hubmer
IT Consultant

EBCONT enterprise technologies GmbH
Millennium Tower
Handelskai 94-96
A-1200 Vienna

Mobile: +43 664 60651861<tel:%2B43%20664%2060651861>
Fax: +43 2772 512 69-9
Email: [email protected]<mailto:[email protected]>
Web: http://www.ebcont.com

OUR TEAM IS YOUR SUCCESS

UID-Nr. ATU68135644
HG St.Pölten - FN 399978 d

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general




--
Andreas Hubmer
IT Consultant

EBCONT enterprise technologies GmbH
Millennium Tower
Handelskai 94-96
A-1200 Vienna

Mobile: +43 664 60651861
Fax: +43 2772 512 69-9
Email: [email protected]<mailto:[email protected]>
Web: http://www.ebcont.com

OUR TEAM IS YOUR SUCCESS

UID-Nr. ATU68135644
HG St.Pölten - FN 399978 d

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Distributing Tasks

Reply via email to