RE: [MarkLogic Dev General] cpf question

Danny Sokolsky Thu, 03 Dec 2009 10:00:38 -0800

You could always put an xdmp:sleep in your action code and/or have the module 
perform its logic a few times before giving up.  Again, this may or may not be 
a problem, depending on your workload.  For example, if the condition returns 
false, you can have it sleep for a minute before calling cpf:failure.


There is no guarantee about the task server queue ordering.  I believe new 
queries are put at the end of the queue, but that does not mean that they will 
evaluate after the others are complete.  Also, if you have cpf running on 
multiple hosts in a cluster, there will be multiple queues.  So I would not 
count on the order of the task server queue.

-Danny

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Mike Sokolov
Sent: Wednesday, December 02, 2009 7:10 PM
To: 'General Mark Logic Developer Discussion'
Subject: RE: [MarkLogic Dev General] cpf question

I think your idea of looping until ready could be useful, but it does sound
like it could churn through a lot of property updates very fast: some of
these documents may be waiting for an hour or more for their dependencies to
be satisfied.  Is it possible to put a document to sleep for a while so that
the re-checking doesn't happen right away?  Or is this likely to happen
anyway due to putting it at the end of the task server queue?  Is there any
sort of guarantee (or at least reliable heuristic) about the queue ordering?

-Mike

> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Danny Sokolsky
> Sent: Wednesday, December 02, 2009 7:46 PM
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] cpf question
> 
> 
> Hi Mike,
> 
> One thing you can think about is to create a pipeline that 
> has some steps that just check your documents to see if they 
> are ready to process.  If they are not ready, put them in a 
> state that checks them again.  If they are ready, put them in 
> a state that actually does the processing.  One way to do 
> this in cpf is by adding two state transitions, the first one 
> to check if it is ready, and another to check-again.  So you 
> have a logical flow something like:
> 
> is-ready
>      on-success -> perform-update
>      on-failure  -> check-again
> check-again
>      on-success -> perform-update
>      on-failure -> is-ready
> perform-update
>      on-success -> final
>      on-failure -> error
> 
> The advantage here is that the step to do the is-ready is 
> relatively inexpensive, as it does not do any updates to your 
> documents (although cpf will update the document's properties 
> fragment).  If you have a test in is-ready that always 
> returns false, however, you will have an infinite loop.  You 
> can guard against that by adding some property or something 
> like that and then put some logic in to check that property 
> and give up after a certain number of steps.
> 
> You have to find the right balance for your application 
> between having more pipeline steps and having each step do 
> too much.  I tend to find it easier to start with what makes 
> logical sense and then optimize it later, if it is needed.
> 
> -Danny
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Mike Sokolov
> Sent: Wednesday, December 02, 2009 12:01 PM
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] cpf question
> 
> I've been working on a cpf pipeline that needs to resolve 
> cross-references between a large number of documents.  Basically; we 
> want to merge some text from the referenced document into the 
> referring 
> document.  I wonder if folks would be able to share good ideas for a 
> strategy here.
> 
> The challenge with a multi-threaded loading pipeline like cpf is that 
> you don't know whether the referenced document is available 
> yet.  There 
> are two main ideas we're working with at the moment:
> 
> 1) a two-pass strategy where you load *all* the documents and then 
> resolve all the references.  This is the most straightforward and (in 
> theory) requires 2N updates, but I can't see how to trigger all the 
> updates in cpf for the second pass: maybe I'm just being 
> thick, but it 
> seems to me that there will come a time when you need to go 
> and update 
> all the documents (another time: make it 3N) just to set 
> their state and 
> trigger phase 2 processing.  And this seems to defeat the 
> whole purpose 
> of cpf, since you then need to build a list of all documents 
> in order to 
> retrigger phase 2.
> 
> 2) a one-pass bidirectional strategy in which each document pulls 
> content from its resolvable references and pushes its content into 
> documents that reference it.  This is completely 
> order-independent, but 
> it results in more updates in a bad case (ie lots of cross 
> references).  
> I think that if the average number of references in a given 
> document is 
> M, then you get something like N + MN/2  updates if the 
> references are 
> distributed evenly.  So if M > 2, this will result in more 
> updates than 
> strategy 1: potentially a *lot* more, if M is large.
> 
> As an aside: No matter what M is, if all the xrefs are in the last 
> document, this takes only N updates.  If they're all in the 
> first one, 
> then you get 2N updates.
> 
> I guess the questions are, then:
> 
> What's the best way to implement a two-pass processing 
> pipeline in cpf?
> 
> Is there some other approach that I haven't thought of?
> 
> -Mike
> _______________________________________________
> General mailing list
> [email protected] 
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected] 
> http://xqzone.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] cpf question

Reply via email to