Re: [MarkLogic Dev General] Distributing Tasks

Andreas Hubmer Thu, 30 Jul 2015 05:48:08 -0700

Yes, this is exactly one way my library could work: Setting up a one-time
scheduled task.
The drawback of scheduled tasks is that the task code needs to be in the
module database. So I would always have to deploy it first.



2015-07-30 14:36 GMT+02:00 David Lee <[email protected]>:

>  This is what let me to believe spawn **may** work across nodes
>
> https://docs.marklogic.com/guide/admin/scheduling_tasks
>
>
>
>
>
> 10. In the Task User and Task Host fields, specify the user with
> permission to invoke the task and the host computer on which the task is to
> be invoked. *If no host is specified, then the task runs on all hosts.*
>
>
>
> So given your feedback, this appears a feature of the task 'scheduler'
>
> For your case that may be easy to use.
>
> If the task just queries for documents on forests on that host then the
> same task can run
>
> on all hosts without change.
>
>
>
>
>
>
>
>
> -----------------------------------------------------------------------------
>
> David Lee
> Lead Engineer
> *Mark**Logic* Corporation
> [email protected]
> Phone: +1 812-482-5224
>
> Cell:  +1 812-630-7622
> www.marklogic.com
>
>
>
> *From:* [email protected] [mailto:
> [email protected]] *On Behalf Of *Andreas Hubmer
> *Sent:* Thursday, July 30, 2015 8:31 AM
>
> *To:* MarkLogic Developer Discussion <[email protected]>
> *Subject:* Re: [MarkLogic Dev General] Distributing Tasks
>
>
>
> David, thanks for your response.
>
>
>
> It's true, that often my tasks are quite simple. Then IO bound is indeed
> the bottleneck. But using distribution I could speed up the processing by
> the number of available nodes.
>
>
>
> I am using Taskbot for batch-processing currently and I start it on every
> node manually. On each node I only process the documents that are local to
> that node (all my nodes are E and D nodes).
>
>
>
> In fact I need a tool/library that can run the taskbot-starting code on
> every node.
>
>
>
> A nice-to-have feature is the possibility to check the status of all nodes
> and find out when they have finished or if any error has happened. This can
> be done with documents in the database.
>
>
>
> The task queue is not shared by all nodes in a group. Each node has its
> own queue. I guess that is because it would be complicated to execute
> things like anonymous functions (for example when using
> xdmp:spawn-function) on other hosts (thinking of the function context...).
>
>
>
> Regards,
>
> Andreas
>
>
>
> 2015-07-30 14:00 GMT+02:00 David Lee <[email protected]>:
>
>   This is an interesting use case.   Much of the distributed processing
> external tools (like mlcp) are designed around initial ingestion.
>
> That is a case that if careful distributed can scale much better than a
> simplistic approach.  However in even in ingestion there are very specific
> cases that will be improved in this way vs much simpler basic concepts like
> batching and multithreading.   Even for ingestion its not always desirable
> to 'over think' the server rather than let it manage the document
> distribution across forests itself.
>
>
>
> But for bulk processing of the nature your describing (adding an element
> to every document).   That may not benefit much from distributing the task
> load across servers.   The user-mode (xquery/js) CPU and memory overhead
> may be very low compared to the IO.
>
> Its conceivable (but I don't know if its implemented) that something like
>
> the following actually execute on the 'd-node' containing the document.
>
>
>
> https://docs.marklogic.com/xdmp:node-insert-after
>
>
>
> xdmp:node-insert-after(doc("/example.xml")/a/b,
>
>     <c>ccc</c>);
>
>
>
> If it did then it wouldn't make much difference at all where it was
> executed,  a few threads at once doing this (via corbs or manually ...
>
> should be able to load the system to capacity and efficiency.
>
> If it pulls the document to the calling node there's an additional
> overhead if its remote, but often not as much as people think.
>
> The latency between hosts on a good network can be smaller than the
> latency to disk.  And at the point you reach IO bound the battle is over,
>
> you can send data back and forth like a tennis match and it won't matter.
>   Then re-indexing and merging is going to kick in and the 'minor' work of
> inserting a node will not be the main contributing factor in the total time.
>
>
>
> Also considering that the task queue is shared by all nodes in the group,
> and xdmp:spawn , xdmp:spawn-function  make use of the task queue  I believe
> (need to check) that it also will make use of all hosts.
>
>
>
> So either way as long as you get some parallelism the load should spread
> fairly well and get close to the ideal maximum if you make use of the
> lowest level document update calls as possible, and take care to not keep
> transactions and locks open as a side effect for searching for the
> documents.
>
>
>
> Its worth trying a simple approach first before attempting to optimize.
>
> Then a basic 'split into parallel batches' should get about as close to
> theoretical maximum as possible.  If you don't use an existing tool, I find
> its both easier overall, and easier to not hit a unexpected locking problem
> by pre-calculating the URI's of all the documents and split that list and
> store it (in the DB or on filesystem).   Then launch your batches giving it
> the list of URIs.
>
> Very well written tools can do better than this, but its more tricky then
> it seems to iterate over the URI's and create batches all at once without
> running into some kind of locking or scheduling problem
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> -----------------------------------------------------------------------------
>
> David Lee
> Lead Engineer
> *Mark**Logic* Corporation
> [email protected]
> Phone: +1 812-482-5224
>
> Cell:  +1 812-630-7622
> www.marklogic.com
>
>
>
> *From:* [email protected] [mailto:
> [email protected]] *On Behalf Of *Geert Josten
> *Sent:* Thursday, July 30, 2015 4:49 AM
>
>
> *To:* MarkLogic Developer Discussion <[email protected]>
> *Subject:* Re: [MarkLogic Dev General] Distributing Tasks
>
>
>
> Hi Andreas,
>
>
>
> Interesting slides, good find!
>
>
>
> If you are talking about more ad hoc processing, you could look into
> things like https://github.com/mblakele/taskbot, and
> https://github.com/marklogic/corb2. These are tools that can batch up the
> work very well. They won’t spread load across a cluster automatically
> though. You could however try to split the load somehow, and run multiple
> instances in parallel, each against a different host. Though, that works
> best if you are targeting the host that actually holds the data you want to
> touch. But that is difficult. MLCP does that with its -fastload option.
> Would MLCP copy feature with a transform perhaps work?
>
>
>
> MarkLogic also provides Hadoop integration, so maybe that is also worth
> looking at?
>
>
>
> Cheers,
>
> Geert
>
>
>
> *From: *<[email protected]> on behalf of Andreas
> Hubmer <[email protected]>
> *Reply-To: *MarkLogic Developer Discussion <
> [email protected]>
> *Date: *Thursday, July 30, 2015 at 8:56 AM
> *To: *MarkLogic Developer Discussion <[email protected]>
> *Subject: *Re: [MarkLogic Dev General] Distributing Tasks
>
>
>
> Hi Geert,
>
>
>
> Thanks for the update.
>
>
>
> Triggers and the CPF aren't exactly what I'm looking for. What I want to
> do is to distribute one-time tasks like adding new elements to all existing
> documents.
>
>
>
> I've found some slides
> <http://developer.marklogic.com/media/mlw12/Distributed-Content-Processing-in-MarkLogic.pdf>from
> a ML consultant on "Distributed Content Processing in MarkLogic" but the
> code builds on ML 4.
>
>
>
> Probably I'll create a lightweight library myself. Either using one-time
> scheduled tasks or an HTTP server for distributing the tasks.
>
>
>
> Regards,
>
> Andreas
>
>
>
> 2015-07-29 17:56 GMT+02:00 Geert Josten <[email protected]>:
>
>   Hi Andreas,
>
>
>
> I haven’t heard about anything in this direction recently, but FWIW I
> added a +1 to the RFE.
>
>
>
> Could post-commit triggers, or CPF help out in some way? They should run
> on the host that holds the forest that holds the document at hand from what
> I have heard..
>
>
>
> Cheers,
>
> Geert
>
>
>
>
>
> *From: *<[email protected]> on behalf of Andreas
> Hubmer <[email protected]>
> *Reply-To: *MarkLogic Developer Discussion <
> [email protected]>
> *Date: *Tuesday, July 28, 2015 at 5:20 PM
> *To: *MarkLogic Developer Discussion <[email protected]>
> *Subject: *[MarkLogic Dev General] Distributing Tasks
>
>
>
> Hello,
>
>
>
> In this Knowledgebase article
> <https://help.marklogic.com/knowledgebase/article/View/112/0/techniques-for-dividing-tasks-between-hosts-in-a-cluster>
> there is talk about an RFE (2763) that would make it possible to pass in
> options into xdmp:spawn() to allow the execution of code on a specific host
> in a cluster.
>
> Are there still any plans for this feature?
>
>
>
> Thanks and cheers,
>
> Andreas
>
>
>
> --
>
> Andreas Hubmer
>
> IT Consultant
>
>
>
>
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
>
>
>
> --
>
> Andreas Hubmer
>
> IT Consultant
>
>
>
> EBCONT enterprise technologies GmbH
>
> Millennium Tower
>
> Handelskai 94-96
>
> A-1200 Vienna
>
>
>
> Mobile: +43 664 60651861
>
> Fax: +43 2772 512 69-9
>
> Email: [email protected]
>
> Web: http://www.ebcont.com
>
>
>
> OUR TEAM IS YOUR SUCCESS
>
>
>
> UID-Nr. ATU68135644
>
> HG St.Pölten - FN 399978 d
>
>
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
>
>
>
> --
>
> Andreas Hubmer
>
> IT Consultant
>
>
>
> EBCONT enterprise technologies GmbH
>
> Millennium Tower
>
> Handelskai 94-96
>
> A-1200 Vienna
>
>
>
> Mobile: +43 664 60651861
>
> Fax: +43 2772 512 69-9
>
> Email: [email protected]
>
> Web: http://www.ebcont.com
>
>
>
> OUR TEAM IS YOUR SUCCESS
>
>
>
> UID-Nr. ATU68135644
>
> HG St.Pölten - FN 399978 d
>
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>


-- 
Andreas Hubmer
IT Consultant

EBCONT enterprise technologies GmbH
Millennium Tower
Handelskai 94-96
A-1200 Vienna

Mobile: +43 664 60651861
Fax: +43 2772 512 69-9
Email: [email protected]
Web: http://www.ebcont.com

OUR TEAM IS YOUR SUCCESS

UID-Nr. ATU68135644
HG St.Pölten - FN 399978 d

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Distributing Tasks

Reply via email to