Here are my addSeedDocuments and processDocuments methods simplifying
them down to the minimum necessary to show what is happening:

@Override
public String addSeedDocuments(ISeedingActivity activities, Specification spec,
                               String lastSeedVersion, long seedTime,
int jobMode)
  throws ManifoldCFException, ServiceInterruption
{
  // return the same 3 docs every time, simulating an initial load, and then
  // these 3 docs changing constantly
  System.out.println(String.format("-=-= SeedTime=%s", seedTime));
  activities.addSeedDocument("100');
  activities.addSeedDocument("110');
  activities.addSeedDocument("120');
  System.out.println("SEEDING DONE");
  return null
}

@Override
public void processDocuments(String[] documentIdentifiers,
IExistingVersions statuses, Specification spec,
                             IProcessActivity activities, int jobMode,
boolean usesDefaultAuthority)
  throws ManifoldCFException, ServiceInterruption {
  System.out.println("-=--=-= PROCESS DOCUMENTS: " +
Arrays.deepToString(documentIdentifiers) );
  // for (String documentIdentifier : documentIdentifiers) {
  //  activities.deleteDocument(documentIdentifier);
  //}

  // I've commented out all subsequent logic here, but adding the call to
  // activities.ingestDocumentWithException(documentIdentifier,
version, documentUri, rd);
  // does not change anything
}

When I run this code with MODEL_ADD_CHANGE_DELETE or with
MODEL_ADD_CHANGE, the output of this is:

-=-= SeedTime=1558733436082
-=--=-= PROCESS DOCUMENTS: [200]
-=--=-= PROCESS DOCUMENTS: [220]
-=--=-= PROCESS DOCUMENTS: [210]
-=-= SeedTime=1558733549367
-=-= SeedTime=1558733609384
-=-= SeedTime=1558733436082
etc.

 "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and then
never again, even though "SEEDING DONE" is printing every minute. If
and only if I uncomment the for loop which deletes the documents does
"processDocuments" get called again for those seed document ids.

I do note that the queue shows documents 100, 110, and 120 in state
"Waiting for processing", and nothing I do seems to affect that. The
database update in JobQueue.updateExistingRecordInitial is a no-op for
these docs, as the status of them is STATUS_PENDINGPURGATORY and the
update does not actually change anything in the db.

Regards,
Raman

On Fri, May 24, 2019 at 5:13 PM Karl Wright <daddy...@gmail.com> wrote:
>
> For any given job run, all documents that are added via addSeedDocuments()
> should be processed.  There is no magic in the framework that somehow knows
> that a document has been created vs. modified vs. deleted until
> processDocuments() is called.  If your claim is that this contract is not
> being honored, could you try changing your connector model to
> MODEL_ADD_CHANGE, just temporarily, to see if everything seems to work
> using that model.  If it does *not* then clearly you've got some kind of
> implementation problem at the addSeedDocuments() level because most of the
> Manifold connectors use that model.
>
> If MODEL_ADD_CHANGE mostly works for you, then the next step is to figure
> out why MODEL_ADD_CHANGE_DELETE is failing.
>
> Karl
>
>
> On Fri, May 24, 2019 at 5:06 PM Raman Gupta <rocketra...@gmail.com> wrote:
>
> > On Fri, May 24, 2019 at 4:41 PM Karl Wright <daddy...@gmail.com> wrote:
> > >
> > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically says
> > > that you have to include *at least* the documents that were changed,
> > added,
> > > or deleted since the previous stamp, and if no stamp is provided, it
> > should
> > > return ALL specified documents.  Are you doing that?
> >
> > Yes, the delta API gives us all the changed, added, and deleted
> > documents, and those are exactly the ones that we are including.
> >
> > > If you are, the next thing to look at is the computation of the version
> > > string.  The version string is what is used to figure out if a change
> > took
> > > place.  You need this IN ADDITION TO the addSeedDocuments() doing the
> > right
> > > thing.  For deleted documents, obviously the processDocuments() should
> > call
> > > the activities.deleteDocument() method.
> >
> > The version String is calculated by `processDocuments`. Since after
> > calling `addSeedDocuments` once for document A version 1,
> > `processDocuments` is never called again for that document, even
> > though it has been modified to document A version 2. Therefore, our
> > connector never gets a chance to return the "version 2" string.
> >
> > > Does this sound like what your code is doing?
> >
> > Yes, as far as we can go given the fact that `processDocuments` is
> > only called once for any particular document identifier.
> >
> > > Karl
> > >
> > >
> > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <rocketra...@gmail.com>
> > wrote:
> > >
> > > > My team is creating a new repository connector. The source system has
> > > > a delta API that lets us know of all new, modified, and deleted
> > > > individual folders and documents since the last call to the API. Each
> > > > call to the delta API provides the changes, as well as a token which
> > > > can be provided on subsequent calls to get changes since that token
> > > > was generated/returned.
> > > >
> > > > What is the best approach to building a repo connector to a system
> > > > that has this type of delta API?
> > > >
> > > > Our first design was an implementation that specifies
> > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > >
> > > > * In addSeedDocuments, on the initial call we seed every document in
> > > > the source system. On subsequent calls, we use the delta API to seed
> > > > every added, modified, or deleted file. We return the delta API token
> > > > as the version value of addSeedDocuments, so that it an be used on
> > > > subsequent calls.
> > > >
> > > > * In processDocuments, we do the usual thing for each document
> > identifier.
> > > >
> > > > On prototyping, this works for new docs, but "processDocuments" is
> > > > never triggered for modified and deleted docs.
> > > >
> > > > A second design we are considering is to use
> > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return only
> > > > one "virtual" document, which represents the root of the remote repo.
> > > >
> > > > Then, in "processDocuments" the new "document" is used to determine
> > > > all the child documents of that delta call, which are then added to
> > > > the queue via `activities.addDocumentReference`. To force the "virtual
> > > > seed" to trigger processDocuments again on the next call to
> > > > `addSeedDocuments`, we do `activities.deleteDocument(virtualDocId)` as
> > > > well.
> > > >
> > > > With this alternative design, the stage 1 seed effectively becomes a
> > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > >
> > > > Thoughts?
> > > >
> > > > Regards,
> > > > Raman Gupta
> > > >
> >

Reply via email to