Interesting, I'll have a closer look! 2015-07-30 15:00 GMT+02:00 Geert Josten <[email protected]>:
> Would this help? It takes tasks from a shared database, but runs from a > schedule, so I think it would spread across the cluster. Haven’t thoroughly > tested it in a cluster environment though.. > > https://github.com/grtjn/ml-queue > > Cheers, > Geert > > From: <[email protected]> on behalf of Andreas > Hubmer <[email protected]> > Reply-To: MarkLogic Developer Discussion <[email protected]> > Date: Thursday, July 30, 2015 at 2:47 PM > To: MarkLogic Developer Discussion <[email protected]> > Subject: Re: [MarkLogic Dev General] Distributing Tasks > > Yes, this is exactly one way my library could work: Setting up a > one-time scheduled task. > The drawback of scheduled tasks is that the task code needs to be in the > module database. So I would always have to deploy it first. > > > 2015-07-30 14:36 GMT+02:00 David Lee <[email protected]>: > >> This is what let me to believe spawn **may** work across nodes >> >> https://docs.marklogic.com/guide/admin/scheduling_tasks >> >> >> >> >> >> 10. In the Task User and Task Host fields, specify the user with >> permission to invoke the task and the host computer on which the task is to >> be invoked. *If no host is specified, then the task runs on all hosts.* >> >> >> >> So given your feedback, this appears a feature of the task 'scheduler' >> >> For your case that may be easy to use. >> >> If the task just queries for documents on forests on that host then the >> same task can run >> >> on all hosts without change. >> >> >> >> >> >> >> >> >> ----------------------------------------------------------------------------- >> >> David Lee >> Lead Engineer >> *Mark**Logic* Corporation >> [email protected] >> Phone: +1 812-482-5224 >> >> Cell: +1 812-630-7622 >> www.marklogic.com >> >> >> >> *From:*[email protected] [mailto: >> [email protected]] *On Behalf Of *Andreas Hubmer >> *Sent:* Thursday, July 30, 2015 8:31 AM >> >> *To:* MarkLogic Developer Discussion <[email protected]> >> *Subject:* Re: [MarkLogic Dev General] Distributing Tasks >> >> >> >> David, thanks for your response. >> >> >> >> It's true, that often my tasks are quite simple. Then IO bound is indeed >> the bottleneck. But using distribution I could speed up the processing by >> the number of available nodes. >> >> >> >> I am using Taskbot for batch-processing currently and I start it on every >> node manually. On each node I only process the documents that are local to >> that node (all my nodes are E and D nodes). >> >> >> >> In fact I need a tool/library that can run the taskbot-starting code on >> every node. >> >> >> >> A nice-to-have feature is the possibility to check the status of all >> nodes and find out when they have finished or if any error has happened. >> This can be done with documents in the database. >> >> >> >> The task queue is not shared by all nodes in a group. Each node has its >> own queue. I guess that is because it would be complicated to execute >> things like anonymous functions (for example when using >> xdmp:spawn-function) on other hosts (thinking of the function context...). >> >> >> >> Regards, >> >> Andreas >> >> >> >> 2015-07-30 14:00 GMT+02:00 David Lee <[email protected]>: >> >> This is an interesting use case. Much of the distributed processing >> external tools (like mlcp) are designed around initial ingestion. >> >> That is a case that if careful distributed can scale much better than a >> simplistic approach. However in even in ingestion there are very specific >> cases that will be improved in this way vs much simpler basic concepts like >> batching and multithreading. Even for ingestion its not always desirable >> to 'over think' the server rather than let it manage the document >> distribution across forests itself. >> >> >> >> But for bulk processing of the nature your describing (adding an element >> to every document). That may not benefit much from distributing the task >> load across servers. The user-mode (xquery/js) CPU and memory overhead >> may be very low compared to the IO. >> >> Its conceivable (but I don't know if its implemented) that something like >> >> the following actually execute on the 'd-node' containing the document. >> >> >> >> https://docs.marklogic.com/xdmp:node-insert-after >> >> >> >> xdmp:node-insert-after(doc("/example.xml")/a/b, >> >> <c>ccc</c>); >> >> >> >> If it did then it wouldn't make much difference at all where it was >> executed, a few threads at once doing this (via corbs or manually ... >> >> should be able to load the system to capacity and efficiency. >> >> If it pulls the document to the calling node there's an additional >> overhead if its remote, but often not as much as people think. >> >> The latency between hosts on a good network can be smaller than the >> latency to disk. And at the point you reach IO bound the battle is over, >> >> you can send data back and forth like a tennis match and it won't matter. >> Then re-indexing and merging is going to kick in and the 'minor' work of >> inserting a node will not be the main contributing factor in the total time. >> >> >> >> Also considering that the task queue is shared by all nodes in the group, >> and xdmp:spawn , xdmp:spawn-function make use of the task queue I believe >> (need to check) that it also will make use of all hosts. >> >> >> >> So either way as long as you get some parallelism the load should spread >> fairly well and get close to the ideal maximum if you make use of the >> lowest level document update calls as possible, and take care to not keep >> transactions and locks open as a side effect for searching for the >> documents. >> >> >> >> Its worth trying a simple approach first before attempting to optimize. >> >> Then a basic 'split into parallel batches' should get about as close to >> theoretical maximum as possible. If you don't use an existing tool, I find >> its both easier overall, and easier to not hit a unexpected locking problem >> by pre-calculating the URI's of all the documents and split that list and >> store it (in the DB or on filesystem). Then launch your batches giving it >> the list of URIs. >> >> Very well written tools can do better than this, but its more tricky then >> it seems to iterate over the URI's and create batches all at once without >> running into some kind of locking or scheduling problem >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ----------------------------------------------------------------------------- >> >> David Lee >> Lead Engineer >> *Mark**Logic* Corporation >> [email protected] >> Phone: +1 812-482-5224 >> >> Cell: +1 812-630-7622 >> www.marklogic.com >> >> >> >> *From:*[email protected] [mailto: >> [email protected]] *On Behalf Of *Geert Josten >> *Sent:* Thursday, July 30, 2015 4:49 AM >> >> >> *To:* MarkLogic Developer Discussion <[email protected]> >> *Subject:* Re: [MarkLogic Dev General] Distributing Tasks >> >> >> >> Hi Andreas, >> >> >> >> Interesting slides, good find! >> >> >> >> If you are talking about more ad hoc processing, you could look into >> things like https://github.com/mblakele/taskbot, and >> https://github.com/marklogic/corb2. These are tools that can batch up >> the work very well. They won’t spread load across a cluster automatically >> though. You could however try to split the load somehow, and run multiple >> instances in parallel, each against a different host. Though, that works >> best if you are targeting the host that actually holds the data you want to >> touch. But that is difficult. MLCP does that with its -fastload option. >> Would MLCP copy feature with a transform perhaps work? >> >> >> >> MarkLogic also provides Hadoop integration, so maybe that is also worth >> looking at? >> >> >> >> Cheers, >> >> Geert >> >> >> >> *From: *<[email protected]> on behalf of Andreas >> Hubmer <[email protected]> >> *Reply-To: *MarkLogic Developer Discussion < >> [email protected]> >> *Date: *Thursday, July 30, 2015 at 8:56 AM >> *To: *MarkLogic Developer Discussion <[email protected]> >> *Subject: *Re: [MarkLogic Dev General] Distributing Tasks >> >> >> >> Hi Geert, >> >> >> >> Thanks for the update. >> >> >> >> Triggers and the CPF aren't exactly what I'm looking for. What I want to >> do is to distribute one-time tasks like adding new elements to all existing >> documents. >> >> >> >> I've found some slides >> <http://developer.marklogic.com/media/mlw12/Distributed-Content-Processing-in-MarkLogic.pdf>from >> a ML consultant on "Distributed Content Processing in MarkLogic" but the >> code builds on ML 4. >> >> >> >> Probably I'll create a lightweight library myself. Either using one-time >> scheduled tasks or an HTTP server for distributing the tasks. >> >> >> >> Regards, >> >> Andreas >> >> >> >> 2015-07-29 17:56 GMT+02:00 Geert Josten <[email protected]>: >> >> Hi Andreas, >> >> >> >> I haven’t heard about anything in this direction recently, but FWIW I >> added a +1 to the RFE. >> >> >> >> Could post-commit triggers, or CPF help out in some way? They should run >> on the host that holds the forest that holds the document at hand from what >> I have heard.. >> >> >> >> Cheers, >> >> Geert >> >> >> >> >> >> *From: *<[email protected]> on behalf of Andreas >> Hubmer <[email protected]> >> *Reply-To: *MarkLogic Developer Discussion < >> [email protected]> >> *Date: *Tuesday, July 28, 2015 at 5:20 PM >> *To: *MarkLogic Developer Discussion <[email protected]> >> *Subject: *[MarkLogic Dev General] Distributing Tasks >> >> >> >> Hello, >> >> >> >> In this Knowledgebase article >> <https://help.marklogic.com/knowledgebase/article/View/112/0/techniques-for-dividing-tasks-between-hosts-in-a-cluster> >> there is talk about an RFE (2763) that would make it possible to pass in >> options into xdmp:spawn() to allow the execution of code on a specific host >> in a cluster. >> >> Are there still any plans for this feature? >> >> >> >> Thanks and cheers, >> >> Andreas >> >> >> >> -- >> >> Andreas Hubmer >> >> IT Consultant >> >> >> >> >> _______________________________________________ >> General mailing list >> [email protected] >> Manage your subscription at: >> http://developer.marklogic.com/mailman/listinfo/general >> >> >> >> >> >> -- >> >> Andreas Hubmer >> >> IT Consultant >> >> >> >> EBCONT enterprise technologies GmbH >> >> Millennium Tower >> >> Handelskai 94-96 >> >> A-1200 Vienna >> >> >> >> Mobile: +43 664 60651861 >> >> Fax: +43 2772 512 69-9 >> >> Email: [email protected] >> >> Web: http://www.ebcont.com >> >> >> >> OUR TEAM IS YOUR SUCCESS >> >> >> >> UID-Nr. ATU68135644 >> >> HG St.Pölten - FN 399978 d >> >> >> _______________________________________________ >> General mailing list >> [email protected] >> Manage your subscription at: >> http://developer.marklogic.com/mailman/listinfo/general >> >> >> >> >> >> -- >> >> Andreas Hubmer >> >> IT Consultant >> >> >> >> EBCONT enterprise technologies GmbH >> >> Millennium Tower >> >> Handelskai 94-96 >> >> A-1200 Vienna >> >> >> >> Mobile: +43 664 60651861 >> >> Fax: +43 2772 512 69-9 >> >> Email: [email protected] >> >> Web: http://www.ebcont.com >> >> >> >> OUR TEAM IS YOUR SUCCESS >> >> >> >> UID-Nr. ATU68135644 >> >> HG St.Pölten - FN 399978 d >> >> _______________________________________________ >> General mailing list >> [email protected] >> Manage your subscription at: >> http://developer.marklogic.com/mailman/listinfo/general >> >> > > > -- > Andreas Hubmer > IT Consultant > > EBCONT enterprise technologies GmbH > Millennium Tower > Handelskai 94-96 > A-1200 Vienna > > Mobile: +43 664 60651861 > Fax: +43 2772 512 69-9 > Email: [email protected] > Web: http://www.ebcont.com > > OUR TEAM IS YOUR SUCCESS > > UID-Nr. ATU68135644 > HG St.Pölten - FN 399978 d > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > > -- Andreas Hubmer IT Consultant EBCONT enterprise technologies GmbH Millennium Tower Handelskai 94-96 A-1200 Vienna Mobile: +43 664 60651861 Fax: +43 2772 512 69-9 Email: [email protected] Web: http://www.ebcont.com OUR TEAM IS YOUR SUCCESS UID-Nr. ATU68135644 HG St.Pölten - FN 399978 d
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
