Re: [MarkLogic Dev General] Distributing Tasks

Andreas Hubmer Thu, 30 Jul 2015 06:19:38 -0700

Interesting, I'll have a closer look!

2015-07-30 15:00 GMT+02:00 Geert Josten <[email protected]>:


>  Would this help? It takes tasks from a shared database, but runs from a
> schedule, so I think it would spread across the cluster. Haven’t thoroughly
> tested it in a cluster environment though..
>
>  https://github.com/grtjn/ml-queue
>
>  Cheers,
> Geert
>
>   From: <[email protected]> on behalf of Andreas
> Hubmer <[email protected]>
> Reply-To: MarkLogic Developer Discussion <[email protected]>
> Date: Thursday, July 30, 2015 at 2:47 PM
> To: MarkLogic Developer Discussion <[email protected]>
> Subject: Re: [MarkLogic Dev General] Distributing Tasks
>
>   Yes, this is exactly one way my library could work: Setting up a
> one-time scheduled task.
> The drawback of scheduled tasks is that the task code needs to be in the
> module database. So I would always have to deploy it first.
>
>
> 2015-07-30 14:36 GMT+02:00 David Lee <[email protected]>:
>
>>  This is what let me to believe spawn **may** work across nodes
>>
>> https://docs.marklogic.com/guide/admin/scheduling_tasks
>>
>>
>>
>>
>>
>> 10. In the Task User and Task Host fields, specify the user with
>> permission to invoke the task and the host computer on which the task is to
>> be invoked. *If no host is specified, then the task runs on all hosts.*
>>
>>
>>
>> So given your feedback, this appears a feature of the task 'scheduler'
>>
>> For your case that may be easy to use.
>>
>> If the task just queries for documents on forests on that host then the
>> same task can run
>>
>> on all hosts without change.
>>
>>
>>
>>
>>
>>
>>
>>
>> -----------------------------------------------------------------------------
>>
>> David Lee
>> Lead Engineer
>> *Mark**Logic* Corporation
>> [email protected]
>> Phone: +1 812-482-5224
>>
>> Cell:  +1 812-630-7622
>> www.marklogic.com
>>
>>
>>
>> *From:*[email protected] [mailto:
>> [email protected]] *On Behalf Of *Andreas Hubmer
>> *Sent:* Thursday, July 30, 2015 8:31 AM
>>
>> *To:* MarkLogic Developer Discussion <[email protected]>
>> *Subject:* Re: [MarkLogic Dev General] Distributing Tasks
>>
>>
>>
>> David, thanks for your response.
>>
>>
>>
>> It's true, that often my tasks are quite simple. Then IO bound is indeed
>> the bottleneck. But using distribution I could speed up the processing by
>> the number of available nodes.
>>
>>
>>
>> I am using Taskbot for batch-processing currently and I start it on every
>> node manually. On each node I only process the documents that are local to
>> that node (all my nodes are E and D nodes).
>>
>>
>>
>> In fact I need a tool/library that can run the taskbot-starting code on
>> every node.
>>
>>
>>
>> A nice-to-have feature is the possibility to check the status of all
>> nodes and find out when they have finished or if any error has happened.
>> This can be done with documents in the database.
>>
>>
>>
>> The task queue is not shared by all nodes in a group. Each node has its
>> own queue. I guess that is because it would be complicated to execute
>> things like anonymous functions (for example when using
>> xdmp:spawn-function) on other hosts (thinking of the function context...).
>>
>>
>>
>> Regards,
>>
>> Andreas
>>
>>
>>
>> 2015-07-30 14:00 GMT+02:00 David Lee <[email protected]>:
>>
>>  This is an interesting use case.   Much of the distributed processing
>> external tools (like mlcp) are designed around initial ingestion.
>>
>> That is a case that if careful distributed can scale much better than a
>> simplistic approach.  However in even in ingestion there are very specific
>> cases that will be improved in this way vs much simpler basic concepts like
>> batching and multithreading.   Even for ingestion its not always desirable
>> to 'over think' the server rather than let it manage the document
>> distribution across forests itself.
>>
>>
>>
>> But for bulk processing of the nature your describing (adding an element
>> to every document).   That may not benefit much from distributing the task
>> load across servers.   The user-mode (xquery/js) CPU and memory overhead
>> may be very low compared to the IO.
>>
>> Its conceivable (but I don't know if its implemented) that something like
>>
>> the following actually execute on the 'd-node' containing the document.
>>
>>
>>
>> https://docs.marklogic.com/xdmp:node-insert-after
>>
>>
>>
>> xdmp:node-insert-after(doc("/example.xml")/a/b,
>>
>>     <c>ccc</c>);
>>
>>
>>
>> If it did then it wouldn't make much difference at all where it was
>> executed,  a few threads at once doing this (via corbs or manually ...
>>
>> should be able to load the system to capacity and efficiency.
>>
>> If it pulls the document to the calling node there's an additional
>> overhead if its remote, but often not as much as people think.
>>
>> The latency between hosts on a good network can be smaller than the
>> latency to disk.  And at the point you reach IO bound the battle is over,
>>
>> you can send data back and forth like a tennis match and it won't matter.
>>   Then re-indexing and merging is going to kick in and the 'minor' work of
>> inserting a node will not be the main contributing factor in the total time.
>>
>>
>>
>> Also considering that the task queue is shared by all nodes in the group,
>> and xdmp:spawn , xdmp:spawn-function  make use of the task queue  I believe
>> (need to check) that it also will make use of all hosts.
>>
>>
>>
>> So either way as long as you get some parallelism the load should spread
>> fairly well and get close to the ideal maximum if you make use of the
>> lowest level document update calls as possible, and take care to not keep
>> transactions and locks open as a side effect for searching for the
>> documents.
>>
>>
>>
>> Its worth trying a simple approach first before attempting to optimize.
>>
>> Then a basic 'split into parallel batches' should get about as close to
>> theoretical maximum as possible.  If you don't use an existing tool, I find
>> its both easier overall, and easier to not hit a unexpected locking problem
>> by pre-calculating the URI's of all the documents and split that list and
>> store it (in the DB or on filesystem).   Then launch your batches giving it
>> the list of URIs.
>>
>> Very well written tools can do better than this, but its more tricky then
>> it seems to iterate over the URI's and create batches all at once without
>> running into some kind of locking or scheduling problem
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> -----------------------------------------------------------------------------
>>
>> David Lee
>> Lead Engineer
>> *Mark**Logic* Corporation
>> [email protected]
>> Phone: +1 812-482-5224
>>
>> Cell:  +1 812-630-7622
>> www.marklogic.com
>>
>>
>>
>> *From:*[email protected] [mailto:
>> [email protected]] *On Behalf Of *Geert Josten
>> *Sent:* Thursday, July 30, 2015 4:49 AM
>>
>>
>> *To:* MarkLogic Developer Discussion <[email protected]>
>> *Subject:* Re: [MarkLogic Dev General] Distributing Tasks
>>
>>
>>
>> Hi Andreas,
>>
>>
>>
>> Interesting slides, good find!
>>
>>
>>
>> If you are talking about more ad hoc processing, you could look into
>> things like https://github.com/mblakele/taskbot, and
>> https://github.com/marklogic/corb2. These are tools that can batch up
>> the work very well. They won’t spread load across a cluster automatically
>> though. You could however try to split the load somehow, and run multiple
>> instances in parallel, each against a different host. Though, that works
>> best if you are targeting the host that actually holds the data you want to
>> touch. But that is difficult. MLCP does that with its -fastload option.
>> Would MLCP copy feature with a transform perhaps work?
>>
>>
>>
>> MarkLogic also provides Hadoop integration, so maybe that is also worth
>> looking at?
>>
>>
>>
>> Cheers,
>>
>> Geert
>>
>>
>>
>> *From: *<[email protected]> on behalf of Andreas
>> Hubmer <[email protected]>
>> *Reply-To: *MarkLogic Developer Discussion <
>> [email protected]>
>> *Date: *Thursday, July 30, 2015 at 8:56 AM
>> *To: *MarkLogic Developer Discussion <[email protected]>
>> *Subject: *Re: [MarkLogic Dev General] Distributing Tasks
>>
>>
>>
>> Hi Geert,
>>
>>
>>
>> Thanks for the update.
>>
>>
>>
>> Triggers and the CPF aren't exactly what I'm looking for. What I want to
>> do is to distribute one-time tasks like adding new elements to all existing
>> documents.
>>
>>
>>
>> I've found some slides
>> <http://developer.marklogic.com/media/mlw12/Distributed-Content-Processing-in-MarkLogic.pdf>from
>> a ML consultant on "Distributed Content Processing in MarkLogic" but the
>> code builds on ML 4.
>>
>>
>>
>> Probably I'll create a lightweight library myself. Either using one-time
>> scheduled tasks or an HTTP server for distributing the tasks.
>>
>>
>>
>> Regards,
>>
>> Andreas
>>
>>
>>
>> 2015-07-29 17:56 GMT+02:00 Geert Josten <[email protected]>:
>>
>>  Hi Andreas,
>>
>>
>>
>> I haven’t heard about anything in this direction recently, but FWIW I
>> added a +1 to the RFE.
>>
>>
>>
>> Could post-commit triggers, or CPF help out in some way? They should run
>> on the host that holds the forest that holds the document at hand from what
>> I have heard..
>>
>>
>>
>> Cheers,
>>
>> Geert
>>
>>
>>
>>
>>
>> *From: *<[email protected]> on behalf of Andreas
>> Hubmer <[email protected]>
>> *Reply-To: *MarkLogic Developer Discussion <
>> [email protected]>
>> *Date: *Tuesday, July 28, 2015 at 5:20 PM
>> *To: *MarkLogic Developer Discussion <[email protected]>
>> *Subject: *[MarkLogic Dev General] Distributing Tasks
>>
>>
>>
>> Hello,
>>
>>
>>
>> In this Knowledgebase article
>> <https://help.marklogic.com/knowledgebase/article/View/112/0/techniques-for-dividing-tasks-between-hosts-in-a-cluster>
>> there is talk about an RFE (2763) that would make it possible to pass in
>> options into xdmp:spawn() to allow the execution of code on a specific host
>> in a cluster.
>>
>> Are there still any plans for this feature?
>>
>>
>>
>> Thanks and cheers,
>>
>> Andreas
>>
>>
>>
>> --
>>
>> Andreas Hubmer
>>
>> IT Consultant
>>
>>
>>
>>
>> _______________________________________________
>> General mailing list
>> [email protected]
>> Manage your subscription at:
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>>
>>
>>
>>
>> --
>>
>> Andreas Hubmer
>>
>> IT Consultant
>>
>>
>>
>> EBCONT enterprise technologies GmbH
>>
>> Millennium Tower
>>
>> Handelskai 94-96
>>
>> A-1200 Vienna
>>
>>
>>
>> Mobile: +43 664 60651861
>>
>> Fax: +43 2772 512 69-9
>>
>> Email: [email protected]
>>
>> Web: http://www.ebcont.com
>>
>>
>>
>> OUR TEAM IS YOUR SUCCESS
>>
>>
>>
>> UID-Nr. ATU68135644
>>
>> HG St.Pölten - FN 399978 d
>>
>>
>> _______________________________________________
>> General mailing list
>> [email protected]
>> Manage your subscription at:
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>>
>>
>>
>>
>> --
>>
>> Andreas Hubmer
>>
>> IT Consultant
>>
>>
>>
>> EBCONT enterprise technologies GmbH
>>
>> Millennium Tower
>>
>> Handelskai 94-96
>>
>> A-1200 Vienna
>>
>>
>>
>> Mobile: +43 664 60651861
>>
>> Fax: +43 2772 512 69-9
>>
>> Email: [email protected]
>>
>> Web: http://www.ebcont.com
>>
>>
>>
>> OUR TEAM IS YOUR SUCCESS
>>
>>
>>
>> UID-Nr. ATU68135644
>>
>> HG St.Pölten - FN 399978 d
>>
>> _______________________________________________
>> General mailing list
>> [email protected]
>> Manage your subscription at:
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>>
>
>
>  --
>    Andreas Hubmer
>  IT Consultant
>
>  EBCONT enterprise technologies GmbH
> Millennium Tower
> Handelskai 94-96
> A-1200 Vienna
>
>  Mobile: +43 664 60651861
> Fax: +43 2772 512 69-9
> Email: [email protected]
> Web: http://www.ebcont.com
>
>  OUR TEAM IS YOUR SUCCESS
>
>  UID-Nr. ATU68135644
> HG St.Pölten - FN 399978 d
>
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>


-- 
Andreas Hubmer
IT Consultant

EBCONT enterprise technologies GmbH
Millennium Tower
Handelskai 94-96
A-1200 Vienna

Mobile: +43 664 60651861
Fax: +43 2772 512 69-9
Email: [email protected]
Web: http://www.ebcont.com

OUR TEAM IS YOUR SUCCESS

UID-Nr. ATU68135644
HG St.Pölten - FN 399978 d

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Distributing Tasks

Reply via email to