Re: [MarkLogic Dev General] Prioritizing entries in the taskserver queue

Tim Meagher Tue, 27 Jul 2010 18:23:10 -0700

Hi Danny,


At this time I don't have the luxury of using another host.  We're looking
into clustering, but not quite there yet.  Even so the whole notion of
prioritizing document processing across a variety of applications is going
to be challenging.

 

As you noted when things get to the CPF it's too late, but I think I'll
change my CPF action queries so that instead of performing the document
processing, they will merely supply the requested action to a set of
prioritized queues.  The question then becomes, 1) What is the most
efficient way to build a queue, to fill it with processing instructions, and
to extract the instructions from it, and 2) how to set up and trigger a
dispatcher that maximizes the use of the task server threads while
extracting instructions from the queues in priority order?  (The assumption
is that the CPF actions should be fairly quick in relation to the actual
processing of each document and that processing does not starve the CPFs
from filling the priority queues).

 

I don't think it makes sense to use a single document for each queue,
instead I think I need either a dedicated directory URI for each priority
queue or a dedicated collection for each queue within a single directory
URI. I like using directory URIs just because it's easier to access them via
webDAV if necessary.  I wonder if indexing the documents based on time of
creation and priority would be useful for quickly identifying which document
of the same priority should be processed next.  I figure I'll use tail
recursion to find the next document to process.

 

As far as the dispatcher, the easiest thing to do is to have one dispatcher
which might be useful as each process consumes a single web resource;
however, allowing for multiple dispatchers up to the number of task server
threads (the number of which dispatchers can be configured to tweak
performance) will probably increase performance. The use of multiple
dispatchers adds complexity because coordination would be required to avoid
redundant processing.

 

I suppose you see where I'm going with this and you and/or others can
provide further suggestions based in experience and a better understanding
of the guts of MarkLogic.

 

Regards,

 

Tim

 

  _____  

From: [email protected]
[mailto:[email protected]] On Behalf Of Danny Sokolsky
Sent: Tuesday, July 27, 2010 6:11 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Prioritizing entries in the taskserver
queue

 

Hi Tim,

 

I don't think there is any way to de-prioritize the order of something on
the task server queue once it is already spawned.  If you wanted to do that,
it would have to be before it is spawned.

 

What you might be able to do (and I think you hinted at this in your
question) is to use a different host to spawn the tasks to.  The host that a
task is spawned to is the same host in which the query is evaluated (the
e-node), so you can try to send higher priority tasks to a different (and
less used) e-node.  I am not sure what the best way to do this is, and I
would guess that would depend on your application.  It could be as simple as
having some dispatcher code somewhere that looks at the priority (your
application would have to supply this) and then redirects the query to
another server.  Or you could do it in a load balancer or proxy forwarder.
By the time it gets to CPF, however, it is probably too late, so this would
have to come before the CPF event is triggered.

 

I don't know of another way to do this, as there is no API do remove or
reorder items in the task server queue.

 

-Danny

 

From: [email protected]
[mailto:[email protected]] On Behalf Of Tim Meagher
Sent: Monday, July 26, 2010 4:13 AM
To: 'General Mark Logic Developer Discussion'
Subject: [MarkLogic Dev General] Prioritizing entries in the task server
queue

 

Hi Folks,

 

I have a workflow for processing documents of various priorities using the
Content Processing Framework.  The problem I'm running into is that I might
get 10,000 documents that need to be processed at a low priority which get
submitted to the task server queue, but then maybe 10 documents come in that
are of a higher priority (I'm using these counts for purposes of
discussion).  What I would like to be able to do is to insert the 10 high
priority items in the queue so that they are processed before any
outstanding low priority items in the task server queue, in other words I
want to interrupt FIFO processing.  I'm not concerned about the high
priority processing starving low priority processing as the volume of the
high priority items is relatively low, but nonetheless an elegant solution
would allow me to fine-tune the process so that low priority starvation does
not occur.

 

There was some previous discussion about using tail-recursion with
xdmp:spawn.  That way I would hopefully be able to select the next document
to process based on its relative priority.  In that case I would probably
want to revise the CPF process to merely fill customized priority queues,
e.g. high, mid, and low priority queues and to use tail recursion to examine
the queues and decide which document to process next.

 

I get the impression that clustering could be a useful way to create task
servers that are dedicated to higher and lower priority processing for the
needs of an entire organization, but it seems to me that allowing for
pre-emption in a given task server could be a really useful feature.

 

Perhaps there are some existing features that are provided to deal with just
this problem.  There are times when I've submitted more docs to be processed
by the task server and would like to be able to dequeue them - I suppose
that a prioritization solution would also allow for dequeuing tasks.

 

Thanks ahead of time for any help!

 

Tim Meagher

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Prioritizing entries in the taskserver queue

Reply via email to