You perfectly described the situation. If I could set of xml files where each set represents a snapshot of the source system state then my crawler would fit manifold design much better. I'll see if it's possible. For sure concurrency can be better exploited this way.
-- Matteo Grolla Sourcesense - making sense of Open Source http://www.sourcesense.com Il giorno 13/giu/2014, alle ore 19:21, Karl Wright ha scritto: > I see; so you are not crawling a repository but instead a sequence of > commands, and you don't know what the actual state of the "repository" is > until all the commands are processed. > > ManifoldCF is not really designed to crawl sequentially-ordered commands. > If you can process the commands in sequence first into a "repository" of > your own construction, then ManifoldCF would be well-suited to picking > documents out of there. I'm trying to think of a good way to do this > without actually doing that preprocessing step, but at the moment I'm > coming up with nothing useful. > > Karl > > > > On Fri, Jun 13, 2014 at 1:14 PM, Matteo Grolla <[email protected]> > wrote: > >> Hi Karl >> the reason is that if I read the commands in this order from the >> files >> >> add{doc1}, delete{doc1}, add{doc1} >> >> after the crawl I should find doc1 in solr >> but if I process them in this order >> >> add{doc1}, add{doc1}, delete{doc1} >> >> there won't be doc1 in solr after the crawl >> >> The concern about sequential performance is right but my use cases >> typically involve few deletion and lots of adds >> >> suppose I have >> >> add{doc1}, add{doc2}, add{doc3}, delete{doc1}, add{doc1} >> >> >> I could process >> add{doc1}, add{doc2}, add{doc3} in parallel >> then delete{doc1} >> then proceed in parallel till the next delete >> >> >> -- >> Matteo Grolla >> Sourcesense - making sense of Open Source >> http://www.sourcesense.com >> >> Il giorno 13/giu/2014, alle ore 19:06, Karl Wright ha scritto: >> >>> One other point: if the reason that you would be trying to order things >> is >>> because you'd want to process the xml document before processing its >>> children, you don't need to worry about that at all either, because the >>> framework takes care of that automatically. All you need to do is handle >>> the case where the carrydown data is not present. >>> >>> Thanks, >>> Karl >>> >>> >>> >>> On Fri, Jun 13, 2014 at 1:03 PM, Karl Wright <[email protected]> wrote: >>> >>>> Hi Matteo, >>>> >>>> The prerequisite event logic is the only way to order document >> processing >>>> in ManifoldCF. The javadoc for the event methods is probably the best >>>> reference to use. I can't say from your description how it would map, >> but >>>> here's the description in question: >>>> >>>> /** This interface abstracts from the activities that use and govern >>>> events. >>>> * >>>> * The purpose of this model is to allow a connector to: >>>> * (a) insure that documents whose prerequisites have not been met do not >>>> get processed until those prerequisites are completed >>>> * (b) guarantee that only one thread at a time deal with sequencing of >>>> documents >>>> * >>>> * The way it works is as follows. We define the notion of an "event", >>>> which is described by a simple string (and thus can be global, >>>> * local to a connection, or local to a job, whichever is appropriate). >> An >>>> event is managed solely by the connector that knows about it. >>>> * Effectively it can be in either of two states: "completed", or >>>> "pending". The only time the framework ever changes an event state is >> when >>>> * the crawler is restarted, at which point all pending events are marked >>>> "completed". >>>> * >>>> * Documents, when they are added to the processing queue, specify the >> set >>>> of events on which they will block. If an event is in the "pending" >> state, >>>> * no documents that block on that event will be processed at that time. >>>> Of course, it is possible that a document could be handed to processing >>>> just before >>>> * an event entered the "pending" state - in which case it is the >>>> responsibility of the connector itself to avoid any problems or >> conflicts. >>>> This can >>>> * usually be handled by proper handling of event signalling. More on >> that >>>> later. >>>> * >>>> * The presumed underlying model of flow inside the connector's >> processing >>>> method is as follows: >>>> * (1) The connector examines the document in question, and decides >> whether >>>> it can be processed successfully or not, based on what it knows about >>>> sequencing >>>> * (2) If the connector determines that the document can properly be >>>> processed, it does so, and that's it. >>>> * (3) If the connector finds a sequencing-related problem, it: >>>> * (a) Begins an appropriate event sequence. >>>> * (b) If the framework indicates that this event is already in the >>>> "pending" state, then some other thread is already handling the event, >> and >>>> the connector >>>> * should abort processing of the current document. >>>> * (c) If the framework successfully begins the event sequence, then >>>> the connector code knows unequivocably that it is the only thread >>>> processing the event. >>>> * It should take whatever action it needs to - which might be >>>> requesting special documents, for instance. [Note well: At this time, >>>> there is no way >>>> * to guarantee that special documents added to the queue are in >>>> fact properly synchronized by this mechanism, so I recommend avoiding >> this >>>> practice, >>>> * and instead handling any special document sequences without >>>> involving the queue.] >>>> * (d) If the connector CANNOT successfully take the action it needs >> to >>>> to push the sequence along, it MUST set the event back to the >> "completed" >>>> state. >>>> * Otherwise, the event will remain in the "pending" state until >>>> the next time the crawler is restarted. >>>> * (e) If the current document cannot yet be processed, its >> processing >>>> should be aborted. >>>> * (4) When the connector determines that the event's conditions have >> been >>>> met, or when it determines that an event sequence is no longer viable >> and >>>> has been >>>> * aborted, it must set the event status to "completed". >>>> * >>>> * In summary, a connector may perform the following event-related >> actions: >>>> * (a) Set an event into the "pending" state >>>> * (b) Set an event into the "completed" state >>>> * (c) Add a document to the queue with a specified set of prerequisite >>>> events attached >>>> * (d) Request that the current document be requeued for later processing >>>> (i.e. abort processing of a document due to sequencing reasons) >>>> * >>>> */ >>>> public interface IEventActivity extends INamingActivity >>>> { >>>> public static final String _rcsid = "@(#)$Id: IEventActivity.java >> 988245 >>>> 2010-08-23 18:39:35Z kwright $"; >>>> >>>> /** Begin an event sequence. >>>> * This method should be called by a connector when a sequencing event >>>> should enter the "pending" state. If the event is already in that >> state, >>>> * this method will return false, otherwise true. The connector has the >>>> responsibility of appropriately managing sequencing given the response >>>> * status. >>>> *@param eventName is the event name. >>>> *@return false if the event is already in the "pending" state. >>>> */ >>>> public boolean beginEventSequence(String eventName) >>>> throws ManifoldCFException; >>>> >>>> /** Complete an event sequence. >>>> * This method should be called to signal that an event is no longer in >>>> the "pending" state. This can mean that the prerequisite processing is >>>> * completed, but it can also mean that prerequisite processing was >>>> aborted or cannot be completed. >>>> * Note well: This method should not be called unless the connector is >>>> CERTAIN that an event is in progress, and that the current thread has >>>> * the sole right to complete it. Otherwise, race conditions can >> develop >>>> which would be difficult to diagnose. >>>> *@param eventName is the event name. >>>> */ >>>> public void completeEventSequence(String eventName) >>>> throws ManifoldCFException; >>>> >>>> /** Abort processing a document (for sequencing reasons). >>>> * This method should be called in order to cause the specified document >>>> to be requeued for later processing. While this is similar in some >> respects >>>> * to the semantics of a ServiceInterruption, it is applicable to only >>>> one document at a time, and also does not specify any delay period, >> since >>>> it is >>>> * presumed that the reason for the requeue is because of sequencing >>>> issues synchronized around an underlying event. >>>> *@param localIdentifier is the document identifier to requeue >>>> */ >>>> public void retryDocumentProcessing(String localIdentifier) >>>> throws ManifoldCFException; >>>> >>>> >>>> } >>>> >>>> >>>> As you can see, these constraints are significant and can cause >>>> single-threaded behavior, so unless you've got a real requirement for >>>> ordering, it's better not to do it. >>>> >>>> Furthermore, the question of deletions is really not germane, because >>>> ManifoldCF does not in fact order deletions at all. They are done >> either >>>> as a side-effect of document processing (when a document is discovered >> to >>>> not be there anymore), or at the end of a job (when orphaned documents >> are >>>> removed). They are also deleted when the job that owns them is deleted. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla < >> [email protected]> >>>> wrote: >>>> >>>>> Hi >>>>> I'm going to develop a manifold connector and one requirements >>>>> is that it should be able to handle document insertion and deletion in >>>>> order (details coming). >>>>> Actually I've already built such crawler as a standalone application >> and >>>>> the design was conceptually this >>>>> >>>>> instead of a Document Queue I have a CommandQueue >>>>> commands can be delete (specifying the docId) or add (specifying >>>>> the doc to be added) >>>>> when a worker thread takes a delete no other worker is allowed to pick >>>>> other commands from the queue until the delete has been committed >>>>> >>>>> >>>>> Ex. suppose I have the following chunk of CommandQueue: >>>>> >>>>> add{doc1}, delete{doc1}, add{doc1} >>>>> >>>>> I need to avoid the situation where commands are processed in this >> order: >>>>> add{doc1}, add{doc1}, delete{doc1} >>>>> >>>>> >>>>> I think the EventSequence could help me implement this synchronization >> in >>>>> Manifold >>>>> when seeding the identifiers I could embed in the identifier the >> command >>>>> Ex. >>>>> instead of stuffing the identifier "hd-samsing-500GB" >>>>> I could stuff "add hd-samsung-500GB" >>>>> >>>>> The question is: Am I running into huge troubles trying to implement >> this >>>>> requirement or not? >>>>> >>>>> -- >>>>> Matteo Grolla >>>>> Sourcesense - making sense of Open Source >>>>> http://www.sourcesense.com >>>>> >>>>> >>>> >> >>
