I see; so you are not crawling a repository but instead a sequence of commands, and you don't know what the actual state of the "repository" is until all the commands are processed.
ManifoldCF is not really designed to crawl sequentially-ordered commands. If you can process the commands in sequence first into a "repository" of your own construction, then ManifoldCF would be well-suited to picking documents out of there. I'm trying to think of a good way to do this without actually doing that preprocessing step, but at the moment I'm coming up with nothing useful. Karl On Fri, Jun 13, 2014 at 1:14 PM, Matteo Grolla <[email protected]> wrote: > Hi Karl > the reason is that if I read the commands in this order from the > files > > add{doc1}, delete{doc1}, add{doc1} > > after the crawl I should find doc1 in solr > but if I process them in this order > > add{doc1}, add{doc1}, delete{doc1} > > there won't be doc1 in solr after the crawl > > The concern about sequential performance is right but my use cases > typically involve few deletion and lots of adds > > suppose I have > > add{doc1}, add{doc2}, add{doc3}, delete{doc1}, add{doc1} > > > I could process > add{doc1}, add{doc2}, add{doc3} in parallel > then delete{doc1} > then proceed in parallel till the next delete > > > -- > Matteo Grolla > Sourcesense - making sense of Open Source > http://www.sourcesense.com > > Il giorno 13/giu/2014, alle ore 19:06, Karl Wright ha scritto: > > > One other point: if the reason that you would be trying to order things > is > > because you'd want to process the xml document before processing its > > children, you don't need to worry about that at all either, because the > > framework takes care of that automatically. All you need to do is handle > > the case where the carrydown data is not present. > > > > Thanks, > > Karl > > > > > > > > On Fri, Jun 13, 2014 at 1:03 PM, Karl Wright <[email protected]> wrote: > > > >> Hi Matteo, > >> > >> The prerequisite event logic is the only way to order document > processing > >> in ManifoldCF. The javadoc for the event methods is probably the best > >> reference to use. I can't say from your description how it would map, > but > >> here's the description in question: > >> > >> /** This interface abstracts from the activities that use and govern > >> events. > >> * > >> * The purpose of this model is to allow a connector to: > >> * (a) insure that documents whose prerequisites have not been met do not > >> get processed until those prerequisites are completed > >> * (b) guarantee that only one thread at a time deal with sequencing of > >> documents > >> * > >> * The way it works is as follows. We define the notion of an "event", > >> which is described by a simple string (and thus can be global, > >> * local to a connection, or local to a job, whichever is appropriate). > An > >> event is managed solely by the connector that knows about it. > >> * Effectively it can be in either of two states: "completed", or > >> "pending". The only time the framework ever changes an event state is > when > >> * the crawler is restarted, at which point all pending events are marked > >> "completed". > >> * > >> * Documents, when they are added to the processing queue, specify the > set > >> of events on which they will block. If an event is in the "pending" > state, > >> * no documents that block on that event will be processed at that time. > >> Of course, it is possible that a document could be handed to processing > >> just before > >> * an event entered the "pending" state - in which case it is the > >> responsibility of the connector itself to avoid any problems or > conflicts. > >> This can > >> * usually be handled by proper handling of event signalling. More on > that > >> later. > >> * > >> * The presumed underlying model of flow inside the connector's > processing > >> method is as follows: > >> * (1) The connector examines the document in question, and decides > whether > >> it can be processed successfully or not, based on what it knows about > >> sequencing > >> * (2) If the connector determines that the document can properly be > >> processed, it does so, and that's it. > >> * (3) If the connector finds a sequencing-related problem, it: > >> * (a) Begins an appropriate event sequence. > >> * (b) If the framework indicates that this event is already in the > >> "pending" state, then some other thread is already handling the event, > and > >> the connector > >> * should abort processing of the current document. > >> * (c) If the framework successfully begins the event sequence, then > >> the connector code knows unequivocably that it is the only thread > >> processing the event. > >> * It should take whatever action it needs to - which might be > >> requesting special documents, for instance. [Note well: At this time, > >> there is no way > >> * to guarantee that special documents added to the queue are in > >> fact properly synchronized by this mechanism, so I recommend avoiding > this > >> practice, > >> * and instead handling any special document sequences without > >> involving the queue.] > >> * (d) If the connector CANNOT successfully take the action it needs > to > >> to push the sequence along, it MUST set the event back to the > "completed" > >> state. > >> * Otherwise, the event will remain in the "pending" state until > >> the next time the crawler is restarted. > >> * (e) If the current document cannot yet be processed, its > processing > >> should be aborted. > >> * (4) When the connector determines that the event's conditions have > been > >> met, or when it determines that an event sequence is no longer viable > and > >> has been > >> * aborted, it must set the event status to "completed". > >> * > >> * In summary, a connector may perform the following event-related > actions: > >> * (a) Set an event into the "pending" state > >> * (b) Set an event into the "completed" state > >> * (c) Add a document to the queue with a specified set of prerequisite > >> events attached > >> * (d) Request that the current document be requeued for later processing > >> (i.e. abort processing of a document due to sequencing reasons) > >> * > >> */ > >> public interface IEventActivity extends INamingActivity > >> { > >> public static final String _rcsid = "@(#)$Id: IEventActivity.java > 988245 > >> 2010-08-23 18:39:35Z kwright $"; > >> > >> /** Begin an event sequence. > >> * This method should be called by a connector when a sequencing event > >> should enter the "pending" state. If the event is already in that > state, > >> * this method will return false, otherwise true. The connector has the > >> responsibility of appropriately managing sequencing given the response > >> * status. > >> *@param eventName is the event name. > >> *@return false if the event is already in the "pending" state. > >> */ > >> public boolean beginEventSequence(String eventName) > >> throws ManifoldCFException; > >> > >> /** Complete an event sequence. > >> * This method should be called to signal that an event is no longer in > >> the "pending" state. This can mean that the prerequisite processing is > >> * completed, but it can also mean that prerequisite processing was > >> aborted or cannot be completed. > >> * Note well: This method should not be called unless the connector is > >> CERTAIN that an event is in progress, and that the current thread has > >> * the sole right to complete it. Otherwise, race conditions can > develop > >> which would be difficult to diagnose. > >> *@param eventName is the event name. > >> */ > >> public void completeEventSequence(String eventName) > >> throws ManifoldCFException; > >> > >> /** Abort processing a document (for sequencing reasons). > >> * This method should be called in order to cause the specified document > >> to be requeued for later processing. While this is similar in some > respects > >> * to the semantics of a ServiceInterruption, it is applicable to only > >> one document at a time, and also does not specify any delay period, > since > >> it is > >> * presumed that the reason for the requeue is because of sequencing > >> issues synchronized around an underlying event. > >> *@param localIdentifier is the document identifier to requeue > >> */ > >> public void retryDocumentProcessing(String localIdentifier) > >> throws ManifoldCFException; > >> > >> > >> } > >> > >> > >> As you can see, these constraints are significant and can cause > >> single-threaded behavior, so unless you've got a real requirement for > >> ordering, it's better not to do it. > >> > >> Furthermore, the question of deletions is really not germane, because > >> ManifoldCF does not in fact order deletions at all. They are done > either > >> as a side-effect of document processing (when a document is discovered > to > >> not be there anymore), or at the end of a job (when orphaned documents > are > >> removed). They are also deleted when the job that owns them is deleted. > >> > >> Karl > >> > >> > >> > >> On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla < > [email protected]> > >> wrote: > >> > >>> Hi > >>> I'm going to develop a manifold connector and one requirements > >>> is that it should be able to handle document insertion and deletion in > >>> order (details coming). > >>> Actually I've already built such crawler as a standalone application > and > >>> the design was conceptually this > >>> > >>> instead of a Document Queue I have a CommandQueue > >>> commands can be delete (specifying the docId) or add (specifying > >>> the doc to be added) > >>> when a worker thread takes a delete no other worker is allowed to pick > >>> other commands from the queue until the delete has been committed > >>> > >>> > >>> Ex. suppose I have the following chunk of CommandQueue: > >>> > >>> add{doc1}, delete{doc1}, add{doc1} > >>> > >>> I need to avoid the situation where commands are processed in this > order: > >>> add{doc1}, add{doc1}, delete{doc1} > >>> > >>> > >>> I think the EventSequence could help me implement this synchronization > in > >>> Manifold > >>> when seeding the identifiers I could embed in the identifier the > command > >>> Ex. > >>> instead of stuffing the identifier "hd-samsing-500GB" > >>> I could stuff "add hd-samsung-500GB" > >>> > >>> The question is: Am I running into huge troubles trying to implement > this > >>> requirement or not? > >>> > >>> -- > >>> Matteo Grolla > >>> Sourcesense - making sense of Open Source > >>> http://www.sourcesense.com > >>> > >>> > >> > >
