Hi Karl
the reason is that if I read the commands in this order from the files
add{doc1}, delete{doc1}, add{doc1}
after the crawl I should find doc1 in solr
but if I process them in this order
add{doc1}, add{doc1}, delete{doc1}
there won't be doc1 in solr after the crawl
The concern about sequential performance is right but my use cases typically
involve few deletion and lots of adds
suppose I have
add{doc1}, add{doc2}, add{doc3}, delete{doc1}, add{doc1}
I could process
add{doc1}, add{doc2}, add{doc3} in parallel
then delete{doc1}
then proceed in parallel till the next delete
--
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com
Il giorno 13/giu/2014, alle ore 19:06, Karl Wright ha scritto:
> One other point: if the reason that you would be trying to order things is
> because you'd want to process the xml document before processing its
> children, you don't need to worry about that at all either, because the
> framework takes care of that automatically. All you need to do is handle
> the case where the carrydown data is not present.
>
> Thanks,
> Karl
>
>
>
> On Fri, Jun 13, 2014 at 1:03 PM, Karl Wright <[email protected]> wrote:
>
>> Hi Matteo,
>>
>> The prerequisite event logic is the only way to order document processing
>> in ManifoldCF. The javadoc for the event methods is probably the best
>> reference to use. I can't say from your description how it would map, but
>> here's the description in question:
>>
>> /** This interface abstracts from the activities that use and govern
>> events.
>> *
>> * The purpose of this model is to allow a connector to:
>> * (a) insure that documents whose prerequisites have not been met do not
>> get processed until those prerequisites are completed
>> * (b) guarantee that only one thread at a time deal with sequencing of
>> documents
>> *
>> * The way it works is as follows. We define the notion of an "event",
>> which is described by a simple string (and thus can be global,
>> * local to a connection, or local to a job, whichever is appropriate). An
>> event is managed solely by the connector that knows about it.
>> * Effectively it can be in either of two states: "completed", or
>> "pending". The only time the framework ever changes an event state is when
>> * the crawler is restarted, at which point all pending events are marked
>> "completed".
>> *
>> * Documents, when they are added to the processing queue, specify the set
>> of events on which they will block. If an event is in the "pending" state,
>> * no documents that block on that event will be processed at that time.
>> Of course, it is possible that a document could be handed to processing
>> just before
>> * an event entered the "pending" state - in which case it is the
>> responsibility of the connector itself to avoid any problems or conflicts.
>> This can
>> * usually be handled by proper handling of event signalling. More on that
>> later.
>> *
>> * The presumed underlying model of flow inside the connector's processing
>> method is as follows:
>> * (1) The connector examines the document in question, and decides whether
>> it can be processed successfully or not, based on what it knows about
>> sequencing
>> * (2) If the connector determines that the document can properly be
>> processed, it does so, and that's it.
>> * (3) If the connector finds a sequencing-related problem, it:
>> * (a) Begins an appropriate event sequence.
>> * (b) If the framework indicates that this event is already in the
>> "pending" state, then some other thread is already handling the event, and
>> the connector
>> * should abort processing of the current document.
>> * (c) If the framework successfully begins the event sequence, then
>> the connector code knows unequivocably that it is the only thread
>> processing the event.
>> * It should take whatever action it needs to - which might be
>> requesting special documents, for instance. [Note well: At this time,
>> there is no way
>> * to guarantee that special documents added to the queue are in
>> fact properly synchronized by this mechanism, so I recommend avoiding this
>> practice,
>> * and instead handling any special document sequences without
>> involving the queue.]
>> * (d) If the connector CANNOT successfully take the action it needs to
>> to push the sequence along, it MUST set the event back to the "completed"
>> state.
>> * Otherwise, the event will remain in the "pending" state until
>> the next time the crawler is restarted.
>> * (e) If the current document cannot yet be processed, its processing
>> should be aborted.
>> * (4) When the connector determines that the event's conditions have been
>> met, or when it determines that an event sequence is no longer viable and
>> has been
>> * aborted, it must set the event status to "completed".
>> *
>> * In summary, a connector may perform the following event-related actions:
>> * (a) Set an event into the "pending" state
>> * (b) Set an event into the "completed" state
>> * (c) Add a document to the queue with a specified set of prerequisite
>> events attached
>> * (d) Request that the current document be requeued for later processing
>> (i.e. abort processing of a document due to sequencing reasons)
>> *
>> */
>> public interface IEventActivity extends INamingActivity
>> {
>> public static final String _rcsid = "@(#)$Id: IEventActivity.java 988245
>> 2010-08-23 18:39:35Z kwright $";
>>
>> /** Begin an event sequence.
>> * This method should be called by a connector when a sequencing event
>> should enter the "pending" state. If the event is already in that state,
>> * this method will return false, otherwise true. The connector has the
>> responsibility of appropriately managing sequencing given the response
>> * status.
>> *@param eventName is the event name.
>> *@return false if the event is already in the "pending" state.
>> */
>> public boolean beginEventSequence(String eventName)
>> throws ManifoldCFException;
>>
>> /** Complete an event sequence.
>> * This method should be called to signal that an event is no longer in
>> the "pending" state. This can mean that the prerequisite processing is
>> * completed, but it can also mean that prerequisite processing was
>> aborted or cannot be completed.
>> * Note well: This method should not be called unless the connector is
>> CERTAIN that an event is in progress, and that the current thread has
>> * the sole right to complete it. Otherwise, race conditions can develop
>> which would be difficult to diagnose.
>> *@param eventName is the event name.
>> */
>> public void completeEventSequence(String eventName)
>> throws ManifoldCFException;
>>
>> /** Abort processing a document (for sequencing reasons).
>> * This method should be called in order to cause the specified document
>> to be requeued for later processing. While this is similar in some respects
>> * to the semantics of a ServiceInterruption, it is applicable to only
>> one document at a time, and also does not specify any delay period, since
>> it is
>> * presumed that the reason for the requeue is because of sequencing
>> issues synchronized around an underlying event.
>> *@param localIdentifier is the document identifier to requeue
>> */
>> public void retryDocumentProcessing(String localIdentifier)
>> throws ManifoldCFException;
>>
>>
>> }
>>
>>
>> As you can see, these constraints are significant and can cause
>> single-threaded behavior, so unless you've got a real requirement for
>> ordering, it's better not to do it.
>>
>> Furthermore, the question of deletions is really not germane, because
>> ManifoldCF does not in fact order deletions at all. They are done either
>> as a side-effect of document processing (when a document is discovered to
>> not be there anymore), or at the end of a job (when orphaned documents are
>> removed). They are also deleted when the job that owns them is deleted.
>>
>> Karl
>>
>>
>>
>> On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla <[email protected]>
>> wrote:
>>
>>> Hi
>>> I'm going to develop a manifold connector and one requirements
>>> is that it should be able to handle document insertion and deletion in
>>> order (details coming).
>>> Actually I've already built such crawler as a standalone application and
>>> the design was conceptually this
>>>
>>> instead of a Document Queue I have a CommandQueue
>>> commands can be delete (specifying the docId) or add (specifying
>>> the doc to be added)
>>> when a worker thread takes a delete no other worker is allowed to pick
>>> other commands from the queue until the delete has been committed
>>>
>>>
>>> Ex. suppose I have the following chunk of CommandQueue:
>>>
>>> add{doc1}, delete{doc1}, add{doc1}
>>>
>>> I need to avoid the situation where commands are processed in this order:
>>> add{doc1}, add{doc1}, delete{doc1}
>>>
>>>
>>> I think the EventSequence could help me implement this synchronization in
>>> Manifold
>>> when seeding the identifiers I could embed in the identifier the command
>>> Ex.
>>> instead of stuffing the identifier "hd-samsing-500GB"
>>> I could stuff "add hd-samsung-500GB"
>>>
>>> The question is: Am I running into huge troubles trying to implement this
>>> requirement or not?
>>>
>>> --
>>> Matteo Grolla
>>> Sourcesense - making sense of Open Source
>>> http://www.sourcesense.com
>>>
>>>
>>