Yes, this is exactly one way my library could work: Setting up a one-time scheduled task. The drawback of scheduled tasks is that the task code needs to be in the module database. So I would always have to deploy it first.
2015-07-30 14:36 GMT+02:00 David Lee <[email protected]>: > This is what let me to believe spawn **may** work across nodes > > https://docs.marklogic.com/guide/admin/scheduling_tasks > > > > > > 10. In the Task User and Task Host fields, specify the user with > permission to invoke the task and the host computer on which the task is to > be invoked. *If no host is specified, then the task runs on all hosts.* > > > > So given your feedback, this appears a feature of the task 'scheduler' > > For your case that may be easy to use. > > If the task just queries for documents on forests on that host then the > same task can run > > on all hosts without change. > > > > > > > > > ----------------------------------------------------------------------------- > > David Lee > Lead Engineer > *Mark**Logic* Corporation > [email protected] > Phone: +1 812-482-5224 > > Cell: +1 812-630-7622 > www.marklogic.com > > > > *From:* [email protected] [mailto: > [email protected]] *On Behalf Of *Andreas Hubmer > *Sent:* Thursday, July 30, 2015 8:31 AM > > *To:* MarkLogic Developer Discussion <[email protected]> > *Subject:* Re: [MarkLogic Dev General] Distributing Tasks > > > > David, thanks for your response. > > > > It's true, that often my tasks are quite simple. Then IO bound is indeed > the bottleneck. But using distribution I could speed up the processing by > the number of available nodes. > > > > I am using Taskbot for batch-processing currently and I start it on every > node manually. On each node I only process the documents that are local to > that node (all my nodes are E and D nodes). > > > > In fact I need a tool/library that can run the taskbot-starting code on > every node. > > > > A nice-to-have feature is the possibility to check the status of all nodes > and find out when they have finished or if any error has happened. This can > be done with documents in the database. > > > > The task queue is not shared by all nodes in a group. Each node has its > own queue. I guess that is because it would be complicated to execute > things like anonymous functions (for example when using > xdmp:spawn-function) on other hosts (thinking of the function context...). > > > > Regards, > > Andreas > > > > 2015-07-30 14:00 GMT+02:00 David Lee <[email protected]>: > > This is an interesting use case. Much of the distributed processing > external tools (like mlcp) are designed around initial ingestion. > > That is a case that if careful distributed can scale much better than a > simplistic approach. However in even in ingestion there are very specific > cases that will be improved in this way vs much simpler basic concepts like > batching and multithreading. Even for ingestion its not always desirable > to 'over think' the server rather than let it manage the document > distribution across forests itself. > > > > But for bulk processing of the nature your describing (adding an element > to every document). That may not benefit much from distributing the task > load across servers. The user-mode (xquery/js) CPU and memory overhead > may be very low compared to the IO. > > Its conceivable (but I don't know if its implemented) that something like > > the following actually execute on the 'd-node' containing the document. > > > > https://docs.marklogic.com/xdmp:node-insert-after > > > > xdmp:node-insert-after(doc("/example.xml")/a/b, > > <c>ccc</c>); > > > > If it did then it wouldn't make much difference at all where it was > executed, a few threads at once doing this (via corbs or manually ... > > should be able to load the system to capacity and efficiency. > > If it pulls the document to the calling node there's an additional > overhead if its remote, but often not as much as people think. > > The latency between hosts on a good network can be smaller than the > latency to disk. And at the point you reach IO bound the battle is over, > > you can send data back and forth like a tennis match and it won't matter. > Then re-indexing and merging is going to kick in and the 'minor' work of > inserting a node will not be the main contributing factor in the total time. > > > > Also considering that the task queue is shared by all nodes in the group, > and xdmp:spawn , xdmp:spawn-function make use of the task queue I believe > (need to check) that it also will make use of all hosts. > > > > So either way as long as you get some parallelism the load should spread > fairly well and get close to the ideal maximum if you make use of the > lowest level document update calls as possible, and take care to not keep > transactions and locks open as a side effect for searching for the > documents. > > > > Its worth trying a simple approach first before attempting to optimize. > > Then a basic 'split into parallel batches' should get about as close to > theoretical maximum as possible. If you don't use an existing tool, I find > its both easier overall, and easier to not hit a unexpected locking problem > by pre-calculating the URI's of all the documents and split that list and > store it (in the DB or on filesystem). Then launch your batches giving it > the list of URIs. > > Very well written tools can do better than this, but its more tricky then > it seems to iterate over the URI's and create batches all at once without > running into some kind of locking or scheduling problem > > > > > > > > > > > > > > > > > > > ----------------------------------------------------------------------------- > > David Lee > Lead Engineer > *Mark**Logic* Corporation > [email protected] > Phone: +1 812-482-5224 > > Cell: +1 812-630-7622 > www.marklogic.com > > > > *From:* [email protected] [mailto: > [email protected]] *On Behalf Of *Geert Josten > *Sent:* Thursday, July 30, 2015 4:49 AM > > > *To:* MarkLogic Developer Discussion <[email protected]> > *Subject:* Re: [MarkLogic Dev General] Distributing Tasks > > > > Hi Andreas, > > > > Interesting slides, good find! > > > > If you are talking about more ad hoc processing, you could look into > things like https://github.com/mblakele/taskbot, and > https://github.com/marklogic/corb2. These are tools that can batch up the > work very well. They won’t spread load across a cluster automatically > though. You could however try to split the load somehow, and run multiple > instances in parallel, each against a different host. Though, that works > best if you are targeting the host that actually holds the data you want to > touch. But that is difficult. MLCP does that with its -fastload option. > Would MLCP copy feature with a transform perhaps work? > > > > MarkLogic also provides Hadoop integration, so maybe that is also worth > looking at? > > > > Cheers, > > Geert > > > > *From: *<[email protected]> on behalf of Andreas > Hubmer <[email protected]> > *Reply-To: *MarkLogic Developer Discussion < > [email protected]> > *Date: *Thursday, July 30, 2015 at 8:56 AM > *To: *MarkLogic Developer Discussion <[email protected]> > *Subject: *Re: [MarkLogic Dev General] Distributing Tasks > > > > Hi Geert, > > > > Thanks for the update. > > > > Triggers and the CPF aren't exactly what I'm looking for. What I want to > do is to distribute one-time tasks like adding new elements to all existing > documents. > > > > I've found some slides > <http://developer.marklogic.com/media/mlw12/Distributed-Content-Processing-in-MarkLogic.pdf>from > a ML consultant on "Distributed Content Processing in MarkLogic" but the > code builds on ML 4. > > > > Probably I'll create a lightweight library myself. Either using one-time > scheduled tasks or an HTTP server for distributing the tasks. > > > > Regards, > > Andreas > > > > 2015-07-29 17:56 GMT+02:00 Geert Josten <[email protected]>: > > Hi Andreas, > > > > I haven’t heard about anything in this direction recently, but FWIW I > added a +1 to the RFE. > > > > Could post-commit triggers, or CPF help out in some way? They should run > on the host that holds the forest that holds the document at hand from what > I have heard.. > > > > Cheers, > > Geert > > > > > > *From: *<[email protected]> on behalf of Andreas > Hubmer <[email protected]> > *Reply-To: *MarkLogic Developer Discussion < > [email protected]> > *Date: *Tuesday, July 28, 2015 at 5:20 PM > *To: *MarkLogic Developer Discussion <[email protected]> > *Subject: *[MarkLogic Dev General] Distributing Tasks > > > > Hello, > > > > In this Knowledgebase article > <https://help.marklogic.com/knowledgebase/article/View/112/0/techniques-for-dividing-tasks-between-hosts-in-a-cluster> > there is talk about an RFE (2763) that would make it possible to pass in > options into xdmp:spawn() to allow the execution of code on a specific host > in a cluster. > > Are there still any plans for this feature? > > > > Thanks and cheers, > > Andreas > > > > -- > > Andreas Hubmer > > IT Consultant > > > > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > > > > > > -- > > Andreas Hubmer > > IT Consultant > > > > EBCONT enterprise technologies GmbH > > Millennium Tower > > Handelskai 94-96 > > A-1200 Vienna > > > > Mobile: +43 664 60651861 > > Fax: +43 2772 512 69-9 > > Email: [email protected] > > Web: http://www.ebcont.com > > > > OUR TEAM IS YOUR SUCCESS > > > > UID-Nr. ATU68135644 > > HG St.Pölten - FN 399978 d > > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > > > > > > -- > > Andreas Hubmer > > IT Consultant > > > > EBCONT enterprise technologies GmbH > > Millennium Tower > > Handelskai 94-96 > > A-1200 Vienna > > > > Mobile: +43 664 60651861 > > Fax: +43 2772 512 69-9 > > Email: [email protected] > > Web: http://www.ebcont.com > > > > OUR TEAM IS YOUR SUCCESS > > > > UID-Nr. ATU68135644 > > HG St.Pölten - FN 399978 d > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > > -- Andreas Hubmer IT Consultant EBCONT enterprise technologies GmbH Millennium Tower Handelskai 94-96 A-1200 Vienna Mobile: +43 664 60651861 Fax: +43 2772 512 69-9 Email: [email protected] Web: http://www.ebcont.com OUR TEAM IS YOUR SUCCESS UID-Nr. ATU68135644 HG St.Pölten - FN 399978 d
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
