Re: processing document addition and delete in order

Matteo Grolla Fri, 13 Jun 2014 10:16:02 -0700

Hi Karl
        the reason is that if I read the commands in this order from the files


        add{doc1}, delete{doc1}, add{doc1}

        after the crawl I should find doc1 in solr
        but if I process them in this order

        add{doc1}, add{doc1}, delete{doc1}

        there won't be doc1 in solr after the crawl

The concern about sequential performance is right but my use cases typically 
involve few deletion and lots of adds

        suppose I have

        add{doc1}, add{doc2}, add{doc3}, delete{doc1}, add{doc1}


        I could process 
        add{doc1}, add{doc2}, add{doc3} in parallel
        then  delete{doc1}
        then proceed in parallel till the next delete


-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 13/giu/2014, alle ore 19:06, Karl Wright ha scritto:

> One other point: if the reason that you would be trying to order things is
> because you'd want to process the xml document before processing its
> children, you don't need to worry about that at all either, because the
> framework takes care of that automatically.  All you need to do is handle
> the case where the carrydown data is not present.
> 
> Thanks,
> Karl
> 
> 
> 
> On Fri, Jun 13, 2014 at 1:03 PM, Karl Wright <[email protected]> wrote:
> 
>> Hi Matteo,
>> 
>> The prerequisite event logic is the only way to order document processing
>> in ManifoldCF.  The javadoc for the event methods is probably the best
>> reference to use.  I can't say from your description how it would map, but
>> here's the description in question:
>> 
>> /** This interface abstracts from the activities that use and govern
>> events.
>> *
>> * The purpose of this model is to allow a connector to:
>> * (a) insure that documents whose prerequisites have not been met do not
>> get processed until those prerequisites are completed
>> * (b) guarantee that only one thread at a time deal with sequencing of
>> documents
>> *
>> * The way it works is as follows.  We define the notion of an "event",
>> which is described by a simple string (and thus can be global,
>> * local to a connection, or local to a job, whichever is appropriate).  An
>> event is managed solely by the connector that knows about it.
>> * Effectively it can be in either of two states: "completed", or
>> "pending".  The only time the framework ever changes an event state is when
>> * the crawler is restarted, at which point all pending events are marked
>> "completed".
>> *
>> * Documents, when they are added to the processing queue, specify the set
>> of events on which they will block.  If an event is in the "pending" state,
>> * no documents that block on that event will be processed at that time.
>> Of course, it is possible that a document could be handed to processing
>> just before
>> * an event entered the "pending" state - in which case it is the
>> responsibility of the connector itself to avoid any problems or conflicts.
>> This can
>> * usually be handled by proper handling of event signalling.  More on that
>> later.
>> *
>> * The presumed underlying model of flow inside the connector's processing
>> method is as follows:
>> * (1) The connector examines the document in question, and decides whether
>> it can be processed successfully or not, based on what it knows about
>> sequencing
>> * (2) If the connector determines that the document can properly be
>> processed, it does so, and that's it.
>> * (3) If the connector finds a sequencing-related problem, it:
>> *     (a) Begins an appropriate event sequence.
>> *     (b) If the framework indicates that this event is already in the
>> "pending" state, then some other thread is already handling the event, and
>> the connector
>> *          should abort processing of the current document.
>> *     (c) If the framework successfully begins the event sequence, then
>> the connector code knows unequivocably that it is the only thread
>> processing the event.
>> *         It should take whatever action it needs to - which might be
>> requesting special documents, for instance.  [Note well: At this time,
>> there is no way
>> *         to guarantee that special documents added to the queue are in
>> fact properly synchronized by this mechanism, so I recommend avoiding this
>> practice,
>> *         and instead handling any special document sequences without
>> involving the queue.]
>> *     (d) If the connector CANNOT successfully take the action it needs to
>> to push the sequence along, it MUST set the event back to the "completed"
>> state.
>> *         Otherwise, the event will remain in the "pending" state until
>> the next time the crawler is restarted.
>> *     (e) If the current document cannot yet be processed, its processing
>> should be aborted.
>> * (4) When the connector determines that the event's conditions have been
>> met, or when it determines that an event sequence is no longer viable and
>> has been
>> *     aborted, it must set the event status to "completed".
>> *
>> * In summary, a connector may perform the following event-related actions:
>> * (a) Set an event into the "pending" state
>> * (b) Set an event into the "completed" state
>> * (c) Add a document to the queue with a specified set of prerequisite
>> events attached
>> * (d) Request that the current document be requeued for later processing
>> (i.e. abort processing of a document due to sequencing reasons)
>> *
>> */
>> public interface IEventActivity extends INamingActivity
>> {
>>  public static final String _rcsid = "@(#)$Id: IEventActivity.java 988245
>> 2010-08-23 18:39:35Z kwright $";
>> 
>>  /** Begin an event sequence.
>>  * This method should be called by a connector when a sequencing event
>> should enter the "pending" state.  If the event is already in that state,
>>  * this method will return false, otherwise true.  The connector has the
>> responsibility of appropriately managing sequencing given the response
>>  * status.
>>  *@param eventName is the event name.
>>  *@return false if the event is already in the "pending" state.
>>  */
>>  public boolean beginEventSequence(String eventName)
>>    throws ManifoldCFException;
>> 
>>  /** Complete an event sequence.
>>  * This method should be called to signal that an event is no longer in
>> the "pending" state.  This can mean that the prerequisite processing is
>>  * completed, but it can also mean that prerequisite processing was
>> aborted or cannot be completed.
>>  * Note well: This method should not be called unless the connector is
>> CERTAIN that an event is in progress, and that the current thread has
>>  * the sole right to complete it.  Otherwise, race conditions can develop
>> which would be difficult to diagnose.
>>  *@param eventName is the event name.
>>  */
>>  public void completeEventSequence(String eventName)
>>    throws ManifoldCFException;
>> 
>>  /** Abort processing a document (for sequencing reasons).
>>  * This method should be called in order to cause the specified document
>> to be requeued for later processing.  While this is similar in some respects
>>  * to the semantics of a ServiceInterruption, it is applicable to only
>> one document at a time, and also does not specify any delay period, since
>> it is
>>  * presumed that the reason for the requeue is because of sequencing
>> issues synchronized around an underlying event.
>>  *@param localIdentifier is the document identifier to requeue
>>  */
>>  public void retryDocumentProcessing(String localIdentifier)
>>    throws ManifoldCFException;
>> 
>> 
>> }
>> 
>> 
>> As you can see, these constraints are significant and can cause
>> single-threaded behavior, so unless you've got a real requirement for
>> ordering, it's better not to do it.
>> 
>> Furthermore, the question of deletions is really not germane, because
>> ManifoldCF does not in fact order deletions at all.  They are done either
>> as a side-effect of document processing (when a document is discovered to
>> not be there anymore), or at the end of a job (when orphaned documents are
>> removed).  They are also deleted when the job that owns them is deleted.
>> 
>> Karl
>> 
>> 
>> 
>> On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla <[email protected]>
>> wrote:
>> 
>>> Hi
>>>        I'm going to develop  a manifold connector and one requirements
>>> is that it should be able to handle document insertion and deletion in
>>> order (details coming).
>>> Actually I've already built such crawler as a standalone application and
>>> the design was conceptually this
>>> 
>>> instead of a Document Queue I have a CommandQueue
>>>        commands can be delete (specifying the docId) or add (specifying
>>> the doc to be added)
>>> when a worker thread takes a delete no other worker is allowed to pick
>>> other commands from the queue until the delete has been committed
>>> 
>>> 
>>> Ex. suppose I have the following chunk of CommandQueue:
>>> 
>>> add{doc1}, delete{doc1}, add{doc1}
>>> 
>>> I need to avoid the situation where commands are processed in this order:
>>> add{doc1}, add{doc1}, delete{doc1}
>>> 
>>> 
>>> I think the EventSequence could help me implement this synchronization in
>>> Manifold
>>> when seeding the identifiers I could embed in the identifier the command
>>> Ex.
>>>        instead of stuffing the identifier "hd-samsing-500GB"
>>>        I could stuff "add hd-samsung-500GB"
>>> 
>>> The question is: Am I running into huge troubles trying to implement this
>>> requirement or not?
>>> 
>>> --
>>> Matteo Grolla
>>> Sourcesense - making sense of Open Source
>>> http://www.sourcesense.com
>>> 
>>> 
>>

Re: processing document addition and delete in order

Reply via email to