Hi Mike,

One thing you can think about is to create a pipeline that has some steps that 
just check your documents to see if they are ready to process.  If they are not 
ready, put them in a state that checks them again.  If they are ready, put them 
in a state that actually does the processing.  One way to do this in cpf is by 
adding two state transitions, the first one to check if it is ready, and 
another to check-again.  So you have a logical flow something like:

is-ready
     on-success -> perform-update
     on-failure  -> check-again
check-again
     on-success -> perform-update
     on-failure -> is-ready
perform-update
     on-success -> final
     on-failure -> error

The advantage here is that the step to do the is-ready is relatively 
inexpensive, as it does not do any updates to your documents (although cpf will 
update the document's properties fragment).  If you have a test in is-ready 
that always returns false, however, you will have an infinite loop.  You can 
guard against that by adding some property or something like that and then put 
some logic in to check that property and give up after a certain number of 
steps.

You have to find the right balance for your application between having more 
pipeline steps and having each step do too much.  I tend to find it easier to 
start with what makes logical sense and then optimize it later, if it is needed.

-Danny

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Mike Sokolov
Sent: Wednesday, December 02, 2009 12:01 PM
To: General Mark Logic Developer Discussion
Subject: [MarkLogic Dev General] cpf question

I've been working on a cpf pipeline that needs to resolve 
cross-references between a large number of documents.  Basically; we 
want to merge some text from the referenced document into the referring 
document.  I wonder if folks would be able to share good ideas for a 
strategy here.

The challenge with a multi-threaded loading pipeline like cpf is that 
you don't know whether the referenced document is available yet.  There 
are two main ideas we're working with at the moment:

1) a two-pass strategy where you load *all* the documents and then 
resolve all the references.  This is the most straightforward and (in 
theory) requires 2N updates, but I can't see how to trigger all the 
updates in cpf for the second pass: maybe I'm just being thick, but it 
seems to me that there will come a time when you need to go and update 
all the documents (another time: make it 3N) just to set their state and 
trigger phase 2 processing.  And this seems to defeat the whole purpose 
of cpf, since you then need to build a list of all documents in order to 
retrigger phase 2.

2) a one-pass bidirectional strategy in which each document pulls 
content from its resolvable references and pushes its content into 
documents that reference it.  This is completely order-independent, but 
it results in more updates in a bad case (ie lots of cross references).  
I think that if the average number of references in a given document is 
M, then you get something like N + MN/2  updates if the references are 
distributed evenly.  So if M > 2, this will result in more updates than 
strategy 1: potentially a *lot* more, if M is large.

As an aside: No matter what M is, if all the xrefs are in the last 
document, this takes only N updates.  If they're all in the first one, 
then you get 2N updates.

I guess the questions are, then:

What's the best way to implement a two-pass processing pipeline in cpf?

Is there some other approach that I haven't thought of?

-Mike
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to