Re: Repository connector for source with with delta API

Karl Wright Fri, 24 May 2019 15:18:54 -0700

So MODEL_ADD_CHANGE does not work for you, eh?

You were saying that every minute a addSeedDocuments is being called,
correct?  It sounds to me like you are running this job in continuous crawl
mode.  Can you try running the job in non-continuous mode, and just
repeating the job run once it completes?


The reason I ask is because continuous crawling has very unique kinds of
ways of dealing with documents it has crawled.  It uses "exponential
backoff" to schedule the next document crawl and that is probably why you
see the documents in the queue but not being processed; you simply haven't
waited long enough.

Karl

Karl


On Fri, May 24, 2019 at 5:36 PM Raman Gupta <rocketra...@gmail.com> wrote:

> Here are my addSeedDocuments and processDocuments methods simplifying
> them down to the minimum necessary to show what is happening:
>
> @Override
> public String addSeedDocuments(ISeedingActivity activities, Specification
> spec,
>                                String lastSeedVersion, long seedTime,
> int jobMode)
>   throws ManifoldCFException, ServiceInterruption
> {
>   // return the same 3 docs every time, simulating an initial load, and
> then
>   // these 3 docs changing constantly
>   System.out.println(String.format("-=-= SeedTime=%s", seedTime));
>   activities.addSeedDocument("100');
>   activities.addSeedDocument("110');
>   activities.addSeedDocument("120');
>   System.out.println("SEEDING DONE");
>   return null
> }
>
> @Override
> public void processDocuments(String[] documentIdentifiers,
> IExistingVersions statuses, Specification spec,
>                              IProcessActivity activities, int jobMode,
> boolean usesDefaultAuthority)
>   throws ManifoldCFException, ServiceInterruption {
>   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> Arrays.deepToString(documentIdentifiers) );
>   // for (String documentIdentifier : documentIdentifiers) {
>   //  activities.deleteDocument(documentIdentifier);
>   //}
>
>   // I've commented out all subsequent logic here, but adding the call to
>   // activities.ingestDocumentWithException(documentIdentifier,
> version, documentUri, rd);
>   // does not change anything
> }
>
> When I run this code with MODEL_ADD_CHANGE_DELETE or with
> MODEL_ADD_CHANGE, the output of this is:
>
> -=-= SeedTime=1558733436082
> -=--=-= PROCESS DOCUMENTS: [200]
> -=--=-= PROCESS DOCUMENTS: [220]
> -=--=-= PROCESS DOCUMENTS: [210]
> -=-= SeedTime=1558733549367
> -=-= SeedTime=1558733609384
> -=-= SeedTime=1558733436082
> etc.
>
>  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and then
> never again, even though "SEEDING DONE" is printing every minute. If
> and only if I uncomment the for loop which deletes the documents does
> "processDocuments" get called again for those seed document ids.
>
> I do note that the queue shows documents 100, 110, and 120 in state
> "Waiting for processing", and nothing I do seems to affect that. The
> database update in JobQueue.updateExistingRecordInitial is a no-op for
> these docs, as the status of them is STATUS_PENDINGPURGATORY and the
> update does not actually change anything in the db.
>
> Regards,
> Raman
>
> On Fri, May 24, 2019 at 5:13 PM Karl Wright <daddy...@gmail.com> wrote:
> >
> > For any given job run, all documents that are added via
> addSeedDocuments()
> > should be processed.  There is no magic in the framework that somehow
> knows
> > that a document has been created vs. modified vs. deleted until
> > processDocuments() is called.  If your claim is that this contract is not
> > being honored, could you try changing your connector model to
> > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to work
> > using that model.  If it does *not* then clearly you've got some kind of
> > implementation problem at the addSeedDocuments() level because most of
> the
> > Manifold connectors use that model.
> >
> > If MODEL_ADD_CHANGE mostly works for you, then the next step is to figure
> > out why MODEL_ADD_CHANGE_DELETE is failing.
> >
> > Karl
> >
> >
> > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <rocketra...@gmail.com>
> wrote:
> >
> > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <daddy...@gmail.com>
> wrote:
> > > >
> > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically
> says
> > > > that you have to include *at least* the documents that were changed,
> > > added,
> > > > or deleted since the previous stamp, and if no stamp is provided, it
> > > should
> > > > return ALL specified documents.  Are you doing that?
> > >
> > > Yes, the delta API gives us all the changed, added, and deleted
> > > documents, and those are exactly the ones that we are including.
> > >
> > > > If you are, the next thing to look at is the computation of the
> version
> > > > string.  The version string is what is used to figure out if a change
> > > took
> > > > place.  You need this IN ADDITION TO the addSeedDocuments() doing the
> > > right
> > > > thing.  For deleted documents, obviously the processDocuments()
> should
> > > call
> > > > the activities.deleteDocument() method.
> > >
> > > The version String is calculated by `processDocuments`. Since after
> > > calling `addSeedDocuments` once for document A version 1,
> > > `processDocuments` is never called again for that document, even
> > > though it has been modified to document A version 2. Therefore, our
> > > connector never gets a chance to return the "version 2" string.
> > >
> > > > Does this sound like what your code is doing?
> > >
> > > Yes, as far as we can go given the fact that `processDocuments` is
> > > only called once for any particular document identifier.
> > >
> > > > Karl
> > > >
> > > >
> > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <rocketra...@gmail.com>
> > > wrote:
> > > >
> > > > > My team is creating a new repository connector. The source system
> has
> > > > > a delta API that lets us know of all new, modified, and deleted
> > > > > individual folders and documents since the last call to the API.
> Each
> > > > > call to the delta API provides the changes, as well as a token
> which
> > > > > can be provided on subsequent calls to get changes since that token
> > > > > was generated/returned.
> > > > >
> > > > > What is the best approach to building a repo connector to a system
> > > > > that has this type of delta API?
> > > > >
> > > > > Our first design was an implementation that specifies
> > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > >
> > > > > * In addSeedDocuments, on the initial call we seed every document
> in
> > > > > the source system. On subsequent calls, we use the delta API to
> seed
> > > > > every added, modified, or deleted file. We return the delta API
> token
> > > > > as the version value of addSeedDocuments, so that it an be used on
> > > > > subsequent calls.
> > > > >
> > > > > * In processDocuments, we do the usual thing for each document
> > > identifier.
> > > > >
> > > > > On prototyping, this works for new docs, but "processDocuments" is
> > > > > never triggered for modified and deleted docs.
> > > > >
> > > > > A second design we are considering is to use
> > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return
> only
> > > > > one "virtual" document, which represents the root of the remote
> repo.
> > > > >
> > > > > Then, in "processDocuments" the new "document" is used to determine
> > > > > all the child documents of that delta call, which are then added to
> > > > > the queue via `activities.addDocumentReference`. To force the
> "virtual
> > > > > seed" to trigger processDocuments again on the next call to
> > > > > `addSeedDocuments`, we do
> `activities.deleteDocument(virtualDocId)` as
> > > > > well.
> > > > >
> > > > > With this alternative design, the stage 1 seed effectively becomes
> a
> > > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > Regards,
> > > > > Raman Gupta
> > > > >
> > >
>

Re: Repository connector for source with with delta API

Reply via email to